DataCamp offer interactive courses related to Python Programming. Since R Markdown documents can run simple Python code chunks (though the data is not accessible to future chunks, a large difference from R Markdown for R), this document attempts to summarize notes from the first module.
Chapter 1 - Python Basics
Hello Python! - focusing on Python specific to data science:
Variables and Types - variables names are case-sensitive in Python:
Example code includes:
# Example, do not modify!
print(5 / 8)
# Put code below here
print(7 + 10)
# Recall that commented lines are marked by the hash-sign, same as R
# Exponentiation is ** and modulo division is %
# Addition and subtraction
print(5 + 5)
print(5 - 5)
# Multiplication and division
print(3 * 5)
print(10 / 2)
# Exponentiation
print(4 ** 2)
# Modulo
print(18 % 7)
# How much is your $100 worth after 7 years?
print(100 * 1.1**7)
# Create a variable savings
savings = 100
# Print out savings
print(savings)
# Create a variable savings
savings = 100
# Create a variable factor
factor = 1.10
# Calculate result
result = savings * factor ** 7
# Print out result
print(result)
# Create a variable desc
desc = "compound interest"
# Create a variable profitable
profitable = True
# Several variables to experiment with
savings = 100
factor = 1.1
desc = "compound interest"
# Assign product of factor and savings to year1
year1 = savings * factor
# Print the type of year1
print(type(year1))
# Assign sum of desc and desc to doubledesc
doubledesc = desc + desc
# Print out doubledesc
print(doubledesc)
# Definition of savings and result
savings = 100
result = 100 * 1.10 ** 7
# Fix the printout
print("I started with $" + str(savings) + " and now have $" + str(result) + ". Awesome!")
# Definition of pi_string
pi_string = "3.1415926"
# Convert pi_string into float: pi_float
pi_float = float(pi_string)
## 0.625
## 17
## 10
## 0
## 15
## 5.0
## 16
## 4
## 194.87171000000012
## 100
## 194.87171000000012
## <class 'float'>
## compound interestcompound interest
## I started with $100 and now have $194.87171000000012. Awesome!
The output all comes at once, another difference from R Markdown for R. In combination with being unable to access any of the variables later in the same document, there are tangible limitations to this approach.
Using Python within R Markdown may be more useful if I install “feather” for both Python and R. Feather allows for running code in Python, then quick-saving pandas in a way that is quick-readable as frames for the next R chunk. See https://blog.rstudio.org/2016/03/29/feather/.
Getting feather for R took just a few seconds using install.packages(). Getting feather for Python 3.6 using Windows seems to require a C++ 14.0 compiler from MS Visual Studio. So far, that is easier said than done.
Chapter 2 - Lists
What are lists? Multiple vales in one variable, formed using square brackets such as myList = [a, b, c]:
Subsetting lists - the first element in the list is defined as element 0:
List manipulation - changing, adding, or removing elements:
Example code includes:
# area variables (in square meters)
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50
# Create list areas
areas = [hall, kit, liv, bed, bath]
# Print areas
print(areas)
# area variables (in square meters)
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50
# Adapt list areas
areas = ["hallway", hall, "kitchen", kit, "living room", liv, "bedroom", bed, "bathroom", bath]
# Print areas
print(areas)
# area variables (in square meters)
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50
# house information as list of lists
house = [["hallway", hall],
["kitchen", kit],
["living room", liv],
["bedroom", bed],
["bathroom", bath]
]
# Print out house
print(house)
# Print out the type of house
print(type(house))
# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
# Print out second element from areas
print(areas[1])
# Print out last element from areas
print(areas[-1])
# Print out the area of the living room
print(areas[5])
# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
# Sum of kitchen and bedroom area: eat_sleep_area
eat_sleep_area = areas[3] + areas[7]
# Print the variable eat_sleep_area
print(eat_sleep_area)
# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
# Use slicing to create downstairs
downstairs = areas[:6]
# Use slicing to create upstairs
upstairs = areas[6:]
# Print out downstairs and upstairs
print(downstairs)
print(upstairs)
# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
# Correct the bathroom area
areas[-1] = 10.5
# Change "living room" to "chill zone"
areas[4] = "chill zone"
# Create the areas list and make some changes
areas = ["hallway", 11.25, "kitchen", 18.0, "chill zone", 20.0,
"bedroom", 10.75, "bathroom", 10.50]
# Add poolhouse data to areas, new list is areas_1
areas_1 = areas + ["poolhouse", 24.5]
# Add garage data to areas_1, new list is areas_2
areas_2 = areas_1 + ["garage", 15.45]
# Create list areas
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
# Create areas_copy
areas_copy = list(areas)
# Change areas_copy
areas_copy[0] = 5.0
# Print areas
print(areas)
## [11.25, 18.0, 20.0, 10.75, 9.5]
## ['hallway', 11.25, 'kitchen', 18.0, 'living room', 20.0, 'bedroom', 10.75, 'bathroom', 9.5]
## [['hallway', 11.25], ['kitchen', 18.0], ['living room', 20.0], ['bedroom', 10.75], ['bathroom', 9.5]]
## <class 'list'>
## 11.25
## 9.5
## 20.0
## 28.75
## ['hallway', 11.25, 'kitchen', 18.0, 'living room', 20.0]
## ['bedroom', 10.75, 'bathroom', 9.5]
## [11.25, 18.0, 20.0, 10.75, 9.5]
Chapter 3 - Functions and Packages
Introduction to functions - pieces of reusable code for solving a particular task:
Methods - all objects of a specific type have default access to the methods for that object:
Packages are directoried of pyhton scripts, each a module specifying functions, methods, and types:
Example code includes:
# Create variables var1 and var2
var1 = [1, 2, 3, 4]
var2 = True
# Print out type of var1
print(type(var1))
# Print out length of var1
print(len(var1))
# Convert var2 to an integer: out2
out2 = int(var2)
# Create lists first and second
first = [11.25, 18.0, 20.0]
second = [10.75, 9.50]
# Paste together first and second: full
full = first + second
# Sort full in descending order: full_sorted
full_sorted = sorted(full, reverse=True)
# Print out full_sorted
print(full_sorted)
# string to experiment with: room
room = "poolhouse"
# Use upper() on room: room_up
room_up = room.upper()
# Print out room and room_up
print(room)
print(room_up)
# Print out the number of o's in room
print(room.count("o"))
# Create list areas
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
# Print out the index of the element 20.0
print(areas.index(20.0))
# Print out how often 14.5 appears in areas
print(areas.count(14.5))
# Create list areas
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
# Use append twice to add poolhouse and garage size
areas.append(24.5)
areas.append(15.45)
# Print out areas
print(areas)
# Reverse the orders of the elements in areas
areas.reverse()
# Print out areas
print(areas)
# Definition of radius
r = 0.43
# Import the math package
import math
# Calculate C
C = 2 * math.pi * r
# Calculate A
A = math.pi * (r ** 2)
# Build printout
print("Circumference: " + str(C))
print("Area: " + str(A))
# Definition of radius
r = 192500
# Import radians function of math package
from math import radians
# Travel distance of Moon over 12 degrees. Store in dist.
dist = r * radians(12)
# Print out dist
print(dist)
## <class 'list'>
## 4
## [20.0, 18.0, 11.25, 10.75, 9.5]
## poolhouse
## POOLHOUSE
## 3
## 2
## 0
## [11.25, 18.0, 20.0, 10.75, 9.5, 24.5, 15.45]
## [15.45, 24.5, 9.5, 10.75, 20.0, 18.0, 11.25]
## Circumference: 2.701769682087222
## Area: 0.5808804816487527
## 40317.10572106901
Chapter 4 - Numpy
Numpy extends list operations using “Numerical Python” (collections of values, optimized for speed):
2D Numpy Arrays - extending the vector to be multi-dimensional:
Numpy Basic Statistics - basic data exploration:
Example code includes:
# Create list baseball
baseball = [180, 215, 210, 210, 188, 176, 209, 200]
# Import the numpy package as np
import numpy as np
# Create a Numpy array from baseball: np_baseball
np_baseball = np.array(baseball)
# Print out type of np_baseball
print(type(np_baseball))
# DO NOT HAVE THE HEIGHT OR WEIGHT DATA - it is MLB data on 1000 players
# Create dummy data
height = np.round(np.random.normal(1.75, 0.20, 5000), 2)
weight = np.round(np.random.normal(60.32, 15, 5000), 2)
# Create a Numpy array from height: np_height
np_height = np.array(height)
# Print out np_height
print(np_height)
# Convert np_height to m: np_height_m
np_height_m = np_height * 0.0254
# Print np_height_m
print(np_height_m)
# Create array from height with correct units: np_height_m
np_height_m = np.array(height) * 0.0254
# Create array from weight with correct units: np_weight_kg
np_weight_kg = np.array(weight) * 0.453592
# Calculate the BMI: bmi
bmi = np_weight_kg / (np_height_m ** 2)
# Print out bmi
print(bmi)
# Calculate the BMI: bmi
np_height_m = np.array(height) * 0.0254
np_weight_kg = np.array(weight) * 0.453592
bmi = np_weight_kg / np_height_m ** 2
# Create the light array
light = bmi < 21
# Print out light
print(light)
# Print out BMIs of all baseball players whose BMI is below 21
print(bmi[light])
# Store weight and height lists as numpy arrays
np_weight = np.array(weight)
np_height = np.array(height)
# Print out the weight at index 50
print(np_weight[50])
# Print out sub-array of np_height: index 100 up to and including index 110
print(np_height[100:111])
# Create baseball, a list of lists
baseball = [[180, 78.4],
[215, 102.7],
[210, 98.5],
[188, 75.2]]
# Import numpy
import numpy as np
# Create a 2D Numpy array from baseball: np_baseball
np_baseball = np.array(baseball)
# Print out the type of np_baseball
print(type(np_baseball))
# Print out the shape of np_baseball
print(np_baseball.shape)
# DO NOT HAVE baseball, which is a list of lists of the 1015 MLB players with their height/weight
# Create a 2D Numpy array from baseball: np_baseball
# np_baseball = np.array(baseball)
# Dummy up the data instead
np_baseball = np.column_stack((height, weight))
# Print out the shape of np_baseball
print(np_baseball.shape) # 1015 x 2
# Create np_baseball (2 cols)
# np_baseball = np.array(baseball)
# Print out the 50th row of np_baseball
print(np_baseball[49])
# Select the entire second column of np_baseball: np_weight
np_weight = np_baseball[:, 1]
# Print out height of 124th player
print(np_baseball[123, 0])
# DO NOT HAVE baseball OR updated ; each should be 1,015 x 3 (height, weight, bmi)
# Create np_baseball (3 cols)
# np_baseball = np.array(baseball)
# Print out addition of np_baseball and updated
# print(np_baseball + updated)
# Create Numpy array: conversion
# conversion = np.array([0.0254, 0.453592, 1])
# Print out product of np_baseball and conversion
# print(np_baseball * conversion)
# Create np_height from np_baseball
np_height = np_baseball[:, 0]
# Print out the mean of np_height
print(np.mean(np_height))
# Print out the median of np_height
print(np.median(np_height))
# Print mean height (first column)
avg = np.mean(np_baseball[:,0])
print("Average: " + str(avg))
# Print median height. Replace 'None'
med = np.median(np_baseball[:,0])
print("Median: " + str(med))
# Print out the standard deviation on height. Replace 'None'
stddev = np.std(np_baseball[:,0])
print("Standard Deviation: " + str(stddev))
# Print out correlation between first and second column. Replace 'None'
corr = np.corrcoef(np_baseball[:, 0], np_baseball[:, 1])
print("Correlation: " + str(corr))
# DO NOT HAVE DATA for positions or heights (soccer data . . . )
# Convert positions and heights to numpy arrays: np_positions, np_heights
# np_positions = np.array(positions)
# np_heights = np.array(heights)
# Heights of the goalkeepers: gk_heights
# gk_heights = np_heights[np_positions == "GK"]
# Heights of the other players: other_heights
# other_heights = np_heights[np_positions != "GK"]
# Print out the median height of goalkeepers. Replace 'None'
# print("Median height of goalkeepers: " + str(np.median(gk_heights)))
# Print out the median height of other players. Replace 'None'
# print("Median height of other players: " + str(np.median(other_heights)))
## <class 'numpy.ndarray'>
## [ 1.99 1.67 1.62 ..., 1.86 1.71 1.58]
## [ 0.050546 0.042418 0.041148 ..., 0.047244 0.043434 0.040132]
## [ 12823.58382701 19121.4400357 16397.97815337 ..., 13315.14663029
## 10182.61085802 14264.71926009]
## [False False False ..., False False False]
## []
## 63.13
## [ 1.74 1.69 1.52 1.87 1.86 1.92 1.36 1.88 2. 1.92 1.36]
## <class 'numpy.ndarray'>
## (4, 2)
## (5000, 2)
## [ 1.46 81.35]
## 1.62
## 1.74487
## 1.75
## Average: 1.74487
## Median: 1.75
## Standard Deviation: 0.197347670622
## Correlation: [[ 1.00000000e+00 2.09381375e-04]
## [ 2.09381375e-04 1.00000000e+00]]
Chapter 1 - Matplotlib for Data Visualization
Basic plots with matplotlib - generally, the heart of visualization within Python:
Histograms are useful for exploring a dataset (getting an idea about the distribution):
Customization for changing the base plot types in Python:
Example code includes:
# Define the reading data path
readPath = "C:/Users/Dave/Documents/Personal/Learning/Coursera/RDirectory/RHomework/DataCamp/"
# This is world population 1950-2100 (DO NOT HAVE FILE)
# Import some wikipedia data from CSV as panda
import pandas as pd
globalPop = pd.read_csv(readPath + "GlobalPopYear_1950_2100_v001.csv")
year = globalPop["year"]
pop = globalPop["pop"]
# Print the last item from year and pop
print(year.iloc[-1])
print(pop.iloc[-1])
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Make a line plot: year on the x-axis, pop on the y-axis
plt.plot(year, pop)
# Display the plot with plt.show()
# Need to use a proper Python IDE for plt.show() - otherwise just pops up the images "live"
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy001.png", bbox_inches="tight")
## 2100
## 11000000002
The population plot saved from Python is:
Next, the Hans Rosling Data is explored:
# Using the Hans Rosling Data (2007 life expectancy and GDP for 142 countries)
# Create from Wikipedia, World Bank, and the like
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# readPath = "C:\\Users\\Dave\\Documents\\Personal\\Learning\\Coursera\\RDirectory\\RHomework\\DataCamp\\"
readPath = "C:/Users/Dave/Documents/Personal/Learning/Coursera/RDirectory/RHomework/DataCamp/"
globalData = pd.read_csv(readPath + "GlobalGDPLifeExpectancy_v001.csv")
gdp_cap = 1000000 * np.array(globalData["gdp"]) / np.array(globalData["pop"])
life_exp = globalData["le_2015"]
pop = globalData["pop"]
life_exp1950 = globalData["le_1960"] # Much easier to get 1960 than 1950 online - KLUGE
regn = globalData["region"]
# Print the last item of gdp_cap and life_exp
print(gdp_cap[-1]) # Since it is a numpy
print(life_exp.iloc[-1]) # Since it is a panda
# Make a line plot, gdp_cap on the x-axis, life_exp on the y-axis
plt.plot(gdp_cap, life_exp)
# Display the plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy002.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Change the line plot below to a scatter plot
plt.scatter(gdp_cap, life_exp)
# Put the x-axis on a logarithmic scale
plt.xscale('log')
# Show plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy003.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Brings in yet another variable, population
# Build Scatter plot
plt.scatter(pop, life_exp)
plt.xscale("log")
# Show plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy004.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Create histogram of life_exp data
plt.hist(life_exp)
# Display histogram
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy005.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Build histogram with 5 bins
plt.hist(life_exp, bins=5)
# Show and clean up plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# plt.clf()
# Save as dummy PNG instead
plt.savefig("_dummyPy006.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Build histogram with 20 bins
plt.hist(life_exp, bins=20)
# Show and clean up again
# Need to use a proper Python IDE for plt.show()
# plt.show()
# plt.clf()
# Save as dummy PNG instead
plt.savefig("_dummyPy007.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Histogram of life_exp, 15 bins
plt.hist(life_exp, bins=15)
# Show and clear plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# plt.clf()
# Save as dummy PNG instead
plt.savefig("_dummyPy008.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Histogram of life_exp1950, 15 bins
plt.hist(life_exp1950, bins=15)
# Show and clear plot again
# Need to use a proper Python IDE for plt.show()
# plt.show()
# plt.clf()
# Save as dummy PNG instead
plt.savefig("_dummyPy009.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Basic scatter plot, log scale
plt.scatter(gdp_cap, life_exp)
plt.xscale('log')
# Strings
xlab = 'GDP per Capita [in USD]'
ylab = 'Life Expectancy [in years]'
title = 'World Development in 2007'
# Add axis labels
plt.xlabel(xlab)
plt.ylabel(ylab)
# Add title
plt.title(title)
# After customizing, display the plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy010.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Scatter plot
plt.scatter(gdp_cap, life_exp)
# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
# Definition of tick_val and tick_lab
tick_val = [1000,10000,100000]
tick_lab = ['1k','10k','100k']
# Adapt the ticks on the x-axis
plt.xticks(tick_val, tick_lab)
# After customizing, display the plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy011.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Import numpy as np
import numpy as np
# Store pop as a numpy array: np_pop
np_pop = np.array(pop) / 1000000 # Population in millions
# Double np_pop
np_pop = np_pop * 2 # Doubled for larger bubbles
# Update: set s argument to np_pop
plt.scatter(gdp_cap, life_exp, s = np_pop)
# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000, 10000, 100000],['1k', '10k', '100k'])
# Display the plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy012.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Color is based on continent, using the below dictionary
colDict = {
'Asia':'red',
'Europe':'green',
'Africa':'blue',
'Americas':'yellow',
'Oceania':'black'
}
col=[]
for eachRegion in regn :
col.append(colDict[eachRegion])
# Specify c and alpha inside plt.scatter()
plt.scatter(x = gdp_cap, y = life_exp, s = np_pop , c=col, alpha=0.8)
# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])
# Show the plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# plt.clf()
# Save as dummy PNG instead
plt.savefig("_dummyPy013.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Scatter plot
plt.scatter(x = gdp_cap, y = life_exp, s = np_pop, c = col, alpha = 0.4)
# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])
# Additional customizations
plt.text(1550, 71, 'India')
plt.text(5700, 80, 'China')
# Add grid() call
plt.grid(True)
# Show the plot
# Need to use a proper Python IDE for plt.show()
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy014.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
## 888.906425266
## 59.2
GDP vs Life Expectancy by Country as Line Graph (not good . . . ):
GDP vs Life Expectancy by Country as Scatter Plot:
GDP vs Life Expectancy by Country as Scatter Plot with Log Scale:
Life Expectancy Histogram (default 10 bins):
Life Expectancy Histogram (5 bins):
Life Expectancy Histogram (20 bins):
Life Expectancy Histogram for 2015 (15 bins):
Life Expectancy Histogram for 1960 (15 bins):
Base Rosling-like graph (GDP vs Life Expectancy by Country Scatter):
Rosling-like graph (enhanced tick labels):
Rosling-like graph (bubble size ~ population):
Rosling-like graph (bubble color based on region):
Rosling-like graph (semit-transparent bubbles):
Chapter 2 - Dictionaries and Pandas
Dictionaries, Part I - key-value pairs:
Dictionaries, Part II:
Pandas, Part I - tabular dataset storage and manipulation:
Pandas, Part II - indexing and selecting data from a DataFrame using square brackets, loc, and iloc:
Example code includes:
# Definition of countries and capital
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']
# Get index of 'germany': ind_ger
ind_ger = countries.index("germany")
# Use ind_ger to print out capital of Germany
print(capitals[ind_ger])
# Definition of countries and capital
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']
# From string in countries and capitals, create dictionary europe
europe = {
'spain':'madrid',
'france':'paris',
'germany':'berlin',
'norway':'oslo'
}
# Print europe
print(europe)
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
# Print out the keys in europe
print(europe.keys())
# Print out value that belongs to key 'norway'
print(europe['norway'])
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
# Add italy to europe
europe['italy'] = 'rome'
# Print out italy in europe
print('italy' in europe)
# Add poland to europe
europe['poland'] = 'warsaw'
# Print europe
print(europe)
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn',
'norway':'oslo', 'italy':'rome', 'poland':'warsaw',
'australia':'vienna' }
# Update capital of germany
europe['germany'] = 'berlin'
# Remove australia
del(europe['australia'])
# Print europe
print(europe)
# Dictionary of dictionaries
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
'france': { 'capital':'paris', 'population':66.03 },
'germany': { 'capital':'berlin', 'population':80.62 },
'norway': { 'capital':'oslo', 'population':5.084 } }
# Print out the capital of France
print(europe['france']['capital'])
# Create sub-dictionary data
data = { 'capital':'rome', 'population':59.83 }
# Add data to europe under key 'italy'
europe['italy'] = data
# Print europe
print(europe)
# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr = [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
# Import pandas as pd
import pandas as pd
# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = { 'country': names, 'drives_right': dr, 'cars_per_cap': cpc }
# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)
# Print cars
print(cars)
# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr = [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(dict)
print(cars)
# Definition of row_labels
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
# Specify row labels of cars
cars.index = row_labels
# Print cars again
print(cars)
# DO NOT HAVE FILE "cars.csv" - cars_per_cap , country , drives_right
# Created as cars.to_csv("cars.csv")
# Import the cars.csv data: cars
cars = pd.read_csv("cars.csv")
# Print out cars
print(cars)
# SLIGHTLY DIFFERENT VERSION WITH ROW NAMES AS THE FIRST COLUMN
# Import pandas as pd
import pandas as pd
# Fix import by including index_col
cars = pd.read_csv('cars.csv', index_col=0)
# Print out cars
print(cars)
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out country column as Pandas Series
print(cars["country"])
# Print out country column as Pandas DataFrame
print(cars[["country"]])
# Print out DataFrame with country and drives_right columns
print(cars[["country", "drives_right"]])
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out first 3 observations
print(cars[0:3])
# Print out fourth, fifth and sixth observation
print(cars[3:6])
# Print out observation for Japan
print(cars.loc["JAP"])
# Print out observations for Australia and Egypt
print(cars.loc[["AUS", "EG"]])
# Print out drives_right value of Morocco
print(cars.loc[["MOR"], ["drives_right"]])
# Print sub-DataFrame
print(cars.loc[["RU", "MOR"], ["country", "drives_right"]])
# Print out drives_right column as Series
print(cars.loc[:, "drives_right"])
# Print out drives_right column as DataFrame
print(cars.loc[:, ["drives_right"]])
# Print out cars_per_cap and drives_right as DataFrame
print(cars.loc[:, ["cars_per_cap", "drives_right"]])
## berlin
## {'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo'}
## dict_keys(['spain', 'france', 'germany', 'norway'])
## oslo
## True
## {'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo', 'italy': 'rome', 'poland': 'warsaw'}
## {'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo', 'italy': 'rome', 'poland': 'warsaw'}
## paris
## {'spain': {'capital': 'madrid', 'population': 46.77}, 'france': {'capital': 'paris', 'population': 66.03}, 'germany': {'capital': 'berlin', 'population': 80.62}, 'norway': {'capital': 'oslo', 'population': 5.084}, 'italy': {'capital': 'rome', 'population': 59.83}}
## cars_per_cap country drives_right
## 0 809 United States True
## 1 731 Australia False
## 2 588 Japan False
## 3 18 India False
## 4 200 Russia True
## 5 70 Morocco True
## 6 45 Egypt True
## cars_per_cap country drives_right
## 0 809 United States True
## 1 731 Australia False
## 2 588 Japan False
## 3 18 India False
## 4 200 Russia True
## 5 70 Morocco True
## 6 45 Egypt True
## cars_per_cap country drives_right
## US 809 United States True
## AUS 731 Australia False
## JAP 588 Japan False
## IN 18 India False
## RU 200 Russia True
## MOR 70 Morocco True
## EG 45 Egypt True
## Unnamed: 0 cars_per_cap country drives_right
## 0 US 809 United States True
## 1 AUS 731 Australia False
## 2 JAP 588 Japan False
## 3 IN 18 India False
## 4 RU 200 Russia True
## 5 MOR 70 Morocco True
## 6 EG 45 Egypt True
## cars_per_cap country drives_right
## US 809 United States True
## AUS 731 Australia False
## JAP 588 Japan False
## IN 18 India False
## RU 200 Russia True
## MOR 70 Morocco True
## EG 45 Egypt True
## US United States
## AUS Australia
## JAP Japan
## IN India
## RU Russia
## MOR Morocco
## EG Egypt
## Name: country, dtype: object
## country
## US United States
## AUS Australia
## JAP Japan
## IN India
## RU Russia
## MOR Morocco
## EG Egypt
## country drives_right
## US United States True
## AUS Australia False
## JAP Japan False
## IN India False
## RU Russia True
## MOR Morocco True
## EG Egypt True
## cars_per_cap country drives_right
## US 809 United States True
## AUS 731 Australia False
## JAP 588 Japan False
## cars_per_cap country drives_right
## IN 18 India False
## RU 200 Russia True
## MOR 70 Morocco True
## cars_per_cap 588
## country Japan
## drives_right False
## Name: JAP, dtype: object
## cars_per_cap country drives_right
## AUS 731 Australia False
## EG 45 Egypt True
## drives_right
## MOR True
## country drives_right
## RU Russia True
## MOR Morocco True
## US True
## AUS False
## JAP False
## IN False
## RU True
## MOR True
## EG True
## Name: drives_right, dtype: bool
## drives_right
## US True
## AUS False
## JAP False
## IN False
## RU True
## MOR True
## EG True
## cars_per_cap drives_right
## US 809 True
## AUS 731 False
## JAP 588 False
## IN 18 False
## RU 200 True
## MOR 70 True
## EG 45 True
Chapter 3 - Logic, Control Flow, and Filtering
Comparison Operators - how two values relate (tests for equality, greater, lesser, etc.):
Boolean operators - most commonly used are and, or, and not:
If, elif, else:
Filtering Pandas DataFrame - generally a three-step process of 1) select key column as panda.series, 2) run test, and 3) use to grab relevant rows:
Example code includes:
# Comparison of booleans
print(True == False)
# Comparison of integers
print((-5 * 15) != 75)
# Comparison of strings
print("pyscript" == "PyScript")
# Compare a boolean with an integer
print(True == 1)
# Comparison of integers
x = -3 * 6
print(x >= -10)
# Comparison of strings
y = "test"
print("test" <= y)
# Comparison of booleans
print(True > False)
# Create arrays
import numpy as np
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])
# my_house greater than or equal to 18
print(my_house >= 18)
# my_house less than your_house
print(my_house < your_house)
# Define variables
my_kitchen = 18.0
your_kitchen = 14.0
# my_kitchen bigger than 10 and smaller than 18?
print(my_kitchen > 10 and my_kitchen < 18)
# my_kitchen smaller than 14 or bigger than 17?
print(my_kitchen < 14 or my_kitchen > 17)
# Double my_kitchen smaller than triple your_kitchen?
print(2 * my_kitchen < 3 * your_kitchen)
# Create arrays
import numpy as np
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])
# my_house greater than 18.5 or smaller than 10
print(np.logical_or(my_house > 18.5, my_house < 10))
# Both my_house and your_house smaller than 11
print(np.logical_and(my_house <11, your_house < 11))
# Define variables
room = "kit"
area = 14.0
# if statement for room
if room == "kit" :
print("looking around in the kitchen.")
# if statement for area
if area > 15 :
print("big place!")
# Define variables
room = "kit"
area = 14.0
# if-else construct for room
if room == "kit" :
print("looking around in the kitchen.")
else :
print("looking around elsewhere.")
# if-else construct for area
if area > 15 :
print("big place!")
else :
print("pretty small.")
# Define variables
room = "bed"
area = 14.0
# if-elif-else construct for room
if room == "kit" :
print("looking around in the kitchen.")
elif room == "bed":
print("looking around in the bedroom.")
else :
print("looking around elsewhere.")
# if-elif-else construct for area
if area > 15 :
print("big place!")
elif area > 10 :
print("medium size, nice!")
else :
print("pretty small.")
# AS PER ABOVE, DO NOT HAVE THIS DATASET
# That has since been worked around . . .
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Extract drives_right column as Series: dr
dr = cars["drives_right"]
# Use dr to subset cars: sel
sel = cars[dr]
# Print sel
print(sel)
# Convert code to a one-liner
sel = cars[cars['drives_right']]
# Print sel
print(sel)
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Create car_maniac: observations that have a cars_per_cap over 500
cpc = cars["cars_per_cap"]
many_cars = cpc > 500
car_maniac = cars[many_cars]
# Print car_maniac
print(car_maniac)
# Create medium: observations with cars_per_cap between 100 and 500
cpc = cars['cars_per_cap']
between = np.logical_and(cpc > 100, cpc < 500)
medium = cars[between]
# Print medium
print(medium)
## False
## True
## False
## True
## False
## True
## True
## [ True True False False]
## [False True True False]
## False
## True
## True
## [False True False True]
## [False False False True]
## looking around in the kitchen.
## looking around in the kitchen.
## pretty small.
## looking around in the bedroom.
## medium size, nice!
## cars_per_cap country drives_right
## US 809 United States True
## RU 200 Russia True
## MOR 70 Morocco True
## EG 45 Egypt True
## cars_per_cap country drives_right
## US 809 United States True
## RU 200 Russia True
## MOR 70 Morocco True
## EG 45 Egypt True
## cars_per_cap country drives_right
## US 809 United States True
## AUS 731 Australia False
## JAP 588 Japan False
## cars_per_cap country drives_right
## RU 200 Russia True
Chapter 4 - Loops
The while loop - alternative to the if/elif/else process:
The for loop - alternative to the while loop:
Looping data structures - Part I - extension to dictionaries, numpy arrays, and the like:
Looping data structures - Part II - extension to pandas DataFrame:
Example code includes:
# Initialize offset
offset = 8
# Code the while loop
while offset != 0 :
print("correcting...")
offset = offset - 1
print(offset)
# Initialize offset
offset = -6
# Code the while loop
while offset != 0 :
print("correcting...")
if offset > 0 :
offset = offset - 1
else :
offset = offset + 1
print(offset)
# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
# Code the for loop
for x in areas :
print(x)
# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
# Change for loop to use enumerate()
for a, b in enumerate(areas) :
print("room " + str(a) + ": " + str(b))
# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
# Code the for loop
for index, area in enumerate(areas) :
print("room " + str(index + 1) + ": " + str(area))
# house list of lists
house = [["hallway", 11.25],
["kitchen", 18.0],
["living room", 20.0],
["bedroom", 10.75],
["bathroom", 9.50]]
# Build a for loop from scratch
for rooms in house :
print("the " + str(rooms[0]) + " is " + str(rooms[1]) + " sqm")
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn',
'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'australia':'vienna' }
# Iterate over europe
for country, capital in europe.items() :
print("the capital of " + str(country) + " is " + str(capital))
# Import numpy as np
import numpy as np
# DO NOT HAVE EITHER DATASET
# Create np_height
height = np.round(np.random.normal(1.75, 0.20, 50), 2)
np_height = np.array(height)
# Create np_baseball
# baseball = [180, 215, 210, 210, 188, 176, 209, 200]
# np_baseball = np.array(baseball)
weight = np.round(np.random.normal(60.32, 15, 50), 2)
np_baseball = np.column_stack((height, weight))
# For loop over np_height
for height in np_height :
print(str(height) + " inches")
# The end= argument over-rides the default to move to a new line
# For loop over np_baseball
for item in np.nditer(np_baseball) :
print(item, end=" ")
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Iterate over rows of cars
for lab, dat in cars.iterrows() :
print(lab)
print(dat)
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Adapt for loop
for lab, row in cars.iterrows() :
print(lab + ": " + str(row['cars_per_cap']))
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Code for loop that adds COUNTRY column
for lab, row in cars.iterrows() :
cars.loc[lab, "COUNTRY"] = row['country'].upper()
# Print cars
print(cars)
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Use .apply(str.upper)
cars["COUNTRY"] = cars["country"].apply(str.upper)
print(cars)
## correcting...
## 7
## correcting...
## 6
## correcting...
## 5
## correcting...
## 4
## correcting...
## 3
## correcting...
## 2
## correcting...
## 1
## correcting...
## 0
## correcting...
## -5
## correcting...
## -4
## correcting...
## -3
## correcting...
## -2
## correcting...
## -1
## correcting...
## 0
## 11.25
## 18.0
## 20.0
## 10.75
## 9.5
## room 0: 11.25
## room 1: 18.0
## room 2: 20.0
## room 3: 10.75
## room 4: 9.5
## room 1: 11.25
## room 2: 18.0
## room 3: 20.0
## room 4: 10.75
## room 5: 9.5
## the hallway is 11.25 sqm
## the kitchen is 18.0 sqm
## the living room is 20.0 sqm
## the bedroom is 10.75 sqm
## the bathroom is 9.5 sqm
## the capital of spain is madrid
## the capital of france is paris
## the capital of germany is bonn
## the capital of norway is oslo
## the capital of italy is rome
## the capital of poland is warsaw
## the capital of australia is vienna
## 1.86 inches
## 1.83 inches
## 2.16 inches
## 2.04 inches
## 1.74 inches
## 1.74 inches
## 1.66 inches
## 1.48 inches
## 1.67 inches
## 1.89 inches
## 1.44 inches
## 2.0 inches
## 1.61 inches
## 1.73 inches
## 1.72 inches
## 1.79 inches
## 1.98 inches
## 1.76 inches
## 1.93 inches
## 1.84 inches
## 2.0 inches
## 1.59 inches
## 1.7 inches
## 1.69 inches
## 1.67 inches
## 1.77 inches
## 1.84 inches
## 1.63 inches
## 1.72 inches
## 2.05 inches
## 1.65 inches
## 1.71 inches
## 1.97 inches
## 1.67 inches
## 1.75 inches
## 1.88 inches
## 1.6 inches
## 1.45 inches
## 1.68 inches
## 1.79 inches
## 1.8 inches
## 1.84 inches
## 1.68 inches
## 1.79 inches
## 1.59 inches
## 1.99 inches
## 1.47 inches
## 1.52 inches
## 1.61 inches
## 1.84 inches
## 1.86 63.56 1.83 48.52 2.16 70.28 2.04 47.14 1.74 42.33 1.74 45.11 1.66 43.91 1.48 63.9 1.67 47.42 1.89 71.32 1.44 42.63 2.0 64.87 1.61 61.82 1.73 27.58 1.72 59.01 1.79 76.09 1.98 42.51 1.76 57.69 1.93 78.88 1.84 65.5 2.0 55.99 1.59 52.06 1.7 55.71 1.69 33.15 1.67 58.68 1.77 44.23 1.84 62.31 1.63 82.69 1.72 72.74 2.05 47.59 1.65 51.78 1.71 84.78 1.97 68.52 1.67 84.83 1.75 71.82 1.88 29.46 1.6 44.87 1.45 52.97 1.68 56.7 1.79 74.7 1.8 79.35 1.84 45.74 1.68 75.53 1.79 53.12 1.59 57.87 1.99 59.38 1.47 87.2 1.52 40.49 1.61 54.79 1.84 74.39 US
## cars_per_cap 809
## country United States
## drives_right True
## Name: US, dtype: object
## AUS
## cars_per_cap 731
## country Australia
## drives_right False
## Name: AUS, dtype: object
## JAP
## cars_per_cap 588
## country Japan
## drives_right False
## Name: JAP, dtype: object
## IN
## cars_per_cap 18
## country India
## drives_right False
## Name: IN, dtype: object
## RU
## cars_per_cap 200
## country Russia
## drives_right True
## Name: RU, dtype: object
## MOR
## cars_per_cap 70
## country Morocco
## drives_right True
## Name: MOR, dtype: object
## EG
## cars_per_cap 45
## country Egypt
## drives_right True
## Name: EG, dtype: object
## US: 809
## AUS: 731
## JAP: 588
## IN: 18
## RU: 200
## MOR: 70
## EG: 45
## cars_per_cap country drives_right COUNTRY
## US 809 United States True UNITED STATES
## AUS 731 Australia False AUSTRALIA
## JAP 588 Japan False JAPAN
## IN 18 India False INDIA
## RU 200 Russia True RUSSIA
## MOR 70 Morocco True MOROCCO
## EG 45 Egypt True EGYPT
## cars_per_cap country drives_right COUNTRY
## US 809 United States True UNITED STATES
## AUS 731 Australia False AUSTRALIA
## JAP 588 Japan False JAPAN
## IN 18 India False INDIA
## RU 200 Russia True RUSSIA
## MOR 70 Morocco True MOROCCO
## EG 45 Egypt True EGYPT
Chapter 5 - Case Study: Hacker Statistics
Random numbers - random walk using a 6-sided dice where 1/2 means -1, 3/4/5 means +1, and 6 means roll again and go up the number of the next roll:
Random walk - well-known pattern in science:
Distribution of random walks - expanding on the 100-trial random walk:
Example code includes:
# Import numpy as np
import numpy as np
# Set the seed
np.random.seed(123)
# Generate and print random float
print(np.random.rand())
# Import numpy and set seed
import numpy as np
np.random.seed(123)
# Use randint() to simulate a dice
print(np.random.randint(1, 7))
# Use randint() again
print(np.random.randint(1, 7))
# Import numpy and set seed
import numpy as np
np.random.seed(123)
# Starting step
step = 50
# Roll the dice
dice = np.random.randint(1, 7)
# Finish the control construct
if dice <= 2 :
step = step - 1
elif dice < 6 :
step = step + 1
else :
step = step + np.random.randint(1,7)
# Print out dice and step
print(dice)
print(step)
# Import numpy and set seed
import numpy as np
np.random.seed(123)
# Initialize random_walk
random_walk = [0]
# Complete the ___
for x in range(100) :
# Set step: last element in random_walk
step = random_walk[-1]
# Roll the dice
dice = np.random.randint(1,7)
# Determine next step
if dice <= 2:
step = step - 1
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)
# append next_step to random_walk
random_walk.append(step)
# Print random_walk
print(random_walk)
# Import numpy and set seed
import numpy as np
np.random.seed(123)
# Initialize random_walk
random_walk = [0]
for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)
if dice <= 2:
# Replace below: use max to make sure step can't go below 0
step = max(0, step - 1)
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)
random_walk.append(step)
print(random_walk)
# Initialization
import numpy as np
np.random.seed(123)
random_walk = [0]
for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)
if dice <= 2:
step = max(0, step - 1)
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)
random_walk.append(step)
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Plot random_walk
plt.plot(random_walk)
# Show the plot
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy015.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Initialization
import numpy as np
np.random.seed(123)
# Initialize all_walks
all_walks = []
# Simulate random walk 10 times
for i in range(10) :
# Code from before
random_walk = [0]
for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)
if dice <= 2:
step = max(0, step - 1)
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)
random_walk.append(step)
# Append random_walk to all_walks
all_walks.append(random_walk)
# Print all_walks
print(all_walks)
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(123)
all_walks = []
for i in range(10) :
random_walk = [0]
for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)
if dice <= 2:
step = max(0, step - 1)
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)
random_walk.append(step)
all_walks.append(random_walk)
# Convert all_walks to Numpy array: np_aw
np_aw = np.array(all_walks)
# Plot np_aw and show
plt.plot(np_aw)
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy016.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Transpose np_aw: np_aw_t
np_aw_t = np.transpose(np_aw)
# Plot np_aw_t and show
plt.plot(np_aw_t)
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy017.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(123)
all_walks = []
# Simulate random walk 250 times
for i in range(250) :
random_walk = [0]
for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)
if dice <= 2:
step = max(0, step - 1)
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)
# Implement clumsiness
if np.random.rand() <= 0.001 :
step = 0
random_walk.append(step)
all_walks.append(random_walk)
# Create and plot np_aw_t
np_aw_t = np.transpose(np.array(all_walks))
plt.plot(np_aw_t)
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy018.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(123)
all_walks = []
# Simulate random walk 500 times
for i in range(500) :
random_walk = [0]
for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)
if dice <= 2:
step = max(0, step - 1)
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)
if np.random.rand() <= 0.001 :
step = 0
random_walk.append(step)
all_walks.append(random_walk)
# Create and plot np_aw_t
np_aw_t = np.transpose(np.array(all_walks))
# Select last row from np_aw_t: ends
ends = np_aw_t[-1]
# Plot histogram of ends, display plot
plt.hist(ends)
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy019.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
## 0.6964691855978616
## 6
## 3
## 6
## 53
## [0, 3, 4, 5, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1, 0, -1, 0, 5, 4, 3, 4, 3, 4, 5, 6, 7, 8, 7, 8, 7, 8, 9, 10, 11, 10, 14, 15, 14, 15, 14, 15, 16, 17, 18, 19, 20, 21, 24, 25, 26, 27, 32, 33, 37, 38, 37, 38, 39, 38, 39, 40, 42, 43, 44, 43, 42, 43, 44, 43, 42, 43, 44, 46, 45, 44, 45, 44, 45, 46, 47, 49, 48, 49, 50, 51, 52, 53, 52, 51, 52, 51, 52, 53, 52, 55, 56, 57, 58, 57, 58, 59]
## [0, 3, 4, 5, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1, 0, 0, 1, 6, 5, 4, 5, 4, 5, 6, 7, 8, 9, 8, 9, 8, 9, 10, 11, 12, 11, 15, 16, 15, 16, 15, 16, 17, 18, 19, 20, 21, 22, 25, 26, 27, 28, 33, 34, 38, 39, 38, 39, 40, 39, 40, 41, 43, 44, 45, 44, 43, 44, 45, 44, 43, 44, 45, 47, 46, 45, 46, 45, 46, 47, 48, 50, 49, 50, 51, 52, 53, 54, 53, 52, 53, 52, 53, 54, 53, 56, 57, 58, 59, 58, 59, 60]
## [[0, 3, 4, 5, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1, 0, 0, 1, 6, 5, 4, 5, 4, 5, 6, 7, 8, 9, 8, 9, 8, 9, 10, 11, 12, 11, 15, 16, 15, 16, 15, 16, 17, 18, 19, 20, 21, 22, 25, 26, 27, 28, 33, 34, 38, 39, 38, 39, 40, 39, 40, 41, 43, 44, 45, 44, 43, 44, 45, 44, 43, 44, 45, 47, 46, 45, 46, 45, 46, 47, 48, 50, 49, 50, 51, 52, 53, 54, 53, 52, 53, 52, 53, 54, 53, 56, 57, 58, 59, 58, 59, 60], [0, 4, 3, 2, 4, 3, 4, 6, 7, 8, 13, 12, 13, 14, 15, 16, 17, 16, 21, 22, 23, 24, 23, 22, 21, 20, 19, 20, 21, 22, 28, 27, 26, 25, 26, 27, 28, 27, 28, 29, 28, 33, 34, 33, 32, 31, 30, 31, 30, 29, 31, 32, 35, 36, 38, 39, 40, 41, 40, 39, 40, 41, 42, 43, 42, 43, 44, 45, 48, 49, 50, 49, 50, 49, 50, 51, 52, 56, 55, 54, 55, 56, 57, 56, 57, 56, 57, 59, 64, 63, 64, 65, 66, 67, 68, 69, 68, 69, 70, 71, 73], [0, 2, 1, 2, 3, 6, 5, 6, 5, 6, 7, 8, 7, 8, 7, 8, 9, 11, 10, 9, 10, 11, 10, 12, 13, 14, 15, 16, 17, 18, 17, 18, 19, 24, 25, 24, 23, 22, 21, 22, 23, 24, 29, 30, 29, 30, 31, 32, 33, 34, 35, 34, 33, 34, 33, 39, 38, 39, 38, 39, 38, 39, 43, 47, 49, 51, 50, 51, 53, 52, 58, 59, 61, 62, 61, 62, 63, 64, 63, 64, 65, 66, 68, 67, 66, 67, 73, 78, 77, 76, 80, 81, 82, 83, 85, 84, 85, 84, 85, 84, 83], [0, 6, 5, 6, 7, 8, 9, 10, 11, 12, 13, 12, 13, 12, 11, 12, 11, 12, 11, 12, 13, 17, 18, 17, 23, 22, 21, 22, 21, 20, 21, 20, 24, 23, 24, 23, 24, 23, 24, 26, 25, 24, 23, 24, 23, 28, 29, 30, 29, 28, 29, 28, 29, 28, 33, 34, 33, 32, 31, 30, 31, 32, 36, 42, 43, 44, 45, 46, 45, 46, 48, 49, 50, 51, 50, 49, 50, 49, 50, 51, 52, 51, 52, 53, 54, 53, 52, 53, 54, 59, 60, 61, 66, 65, 66, 65, 66, 67, 68, 69, 68], [0, 6, 5, 6, 5, 4, 5, 9, 10, 11, 12, 13, 12, 11, 10, 9, 8, 9, 10, 11, 12, 13, 14, 13, 14, 15, 14, 15, 16, 19, 18, 19, 18, 19, 22, 23, 24, 25, 24, 23, 26, 27, 28, 29, 28, 27, 28, 31, 32, 37, 38, 37, 38, 37, 38, 37, 43, 42, 41, 42, 44, 43, 42, 41, 42, 43, 44, 45, 49, 54, 55, 56, 57, 60, 61, 62, 63, 64, 65, 66, 65, 64, 65, 66, 65, 71, 70, 71, 72, 71, 70, 71, 70, 69, 75, 74, 73, 74, 75, 74, 73], [0, 0, 0, 1, 7, 8, 11, 12, 18, 19, 20, 26, 25, 31, 30, 31, 32, 33, 32, 38, 39, 38, 39, 38, 39, 38, 39, 38, 39, 43, 44, 46, 45, 46, 45, 44, 45, 44, 45, 44, 48, 52, 51, 50, 49, 50, 51, 55, 56, 57, 61, 60, 59, 58, 59, 60, 62, 61, 60, 61, 62, 64, 67, 72, 73, 72, 73, 74, 75, 76, 77, 76, 77, 78, 84, 83, 88, 87, 91, 90, 94, 93, 96, 97, 96, 97, 103, 102, 101, 100, 104, 103, 102, 103, 104, 103, 104, 105, 106, 107, 106], [0, 0, 0, 1, 0, 0, 4, 5, 7, 11, 17, 16, 15, 16, 17, 18, 17, 18, 17, 18, 19, 18, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 33, 32, 35, 36, 35, 34, 35, 36, 37, 36, 35, 34, 33, 34, 35, 36, 37, 38, 39, 40, 39, 40, 41, 43, 42, 43, 44, 47, 49, 50, 49, 48, 47, 46, 45, 46, 45, 46, 48, 49, 50, 49, 50, 49, 48, 49, 48, 47, 46, 47, 46, 45, 46, 47, 48, 50, 51, 52, 51, 50, 51, 57, 56, 57, 58, 63, 62, 63], [0, 0, 1, 2, 1, 2, 3, 9, 10, 11, 12, 11, 13, 14, 15, 16, 15, 16, 17, 18, 19, 18, 19, 18, 19, 20, 19, 20, 24, 25, 28, 29, 33, 34, 33, 34, 35, 34, 33, 38, 39, 40, 39, 38, 39, 40, 41, 40, 44, 43, 44, 45, 46, 47, 48, 49, 50, 49, 48, 47, 48, 49, 53, 54, 53, 54, 55, 54, 60, 61, 62, 63, 62, 63, 64, 67, 66, 67, 66, 65, 64, 65, 66, 68, 69, 70, 74, 75, 74, 73, 74, 75, 74, 73, 74, 75, 76, 75, 74, 75, 76], [0, 1, 0, 1, 2, 1, 0, 0, 1, 2, 3, 4, 5, 10, 14, 13, 14, 13, 12, 11, 12, 11, 12, 13, 12, 16, 17, 16, 17, 16, 15, 16, 15, 19, 20, 21, 22, 23, 24, 23, 24, 25, 26, 27, 28, 27, 32, 33, 34, 33, 34, 33, 34, 35, 34, 35, 40, 41, 42, 41, 42, 43, 44, 43, 44, 43, 44, 45, 44, 43, 42, 43, 44, 43, 42, 41, 42, 46, 47, 48, 49, 50, 51, 50, 51, 52, 51, 52, 57, 58, 57, 56, 57, 56, 55, 54, 58, 59, 60, 61, 60], [0, 1, 2, 3, 4, 5, 4, 3, 6, 5, 4, 3, 2, 3, 9, 10, 9, 10, 11, 10, 9, 10, 11, 12, 11, 15, 16, 15, 17, 18, 17, 18, 19, 20, 21, 22, 23, 22, 21, 22, 23, 22, 23, 24, 23, 22, 21, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 33, 34, 35, 36, 37, 38, 37, 36, 42, 43, 44, 43, 42, 41, 45, 46, 50, 49, 55, 56, 57, 61, 62, 61, 60, 61, 62, 63, 64, 63, 69, 70, 69, 73, 74, 73, 74, 73, 79, 85, 86, 85, 86, 87]]
Single random walk:
10 full walks:
10 full walks transposed:
250 random walks with “clumsiness”:
500 random walks with “clumsiness”:
Chapter 1 - Writing your own functions
User-defined functions - with/without parameters, and with/without returning values:
Multiple parameters and return values:
Bringing it all together - practical examples using Twitter data:
Example code includes:
# Define the function shout
def shout():
"""Print a string with three exclamation marks"""
# Concatenate the strings: shout_word
shout_word = "congratulations" + "!!!"
# Print shout_word
print(shout_word)
# Call shout
shout()
# Define shout with the parameter, word
def shout(word):
"""Print a string with three exclamation marks"""
# Concatenate the strings: shout_word
shout_word = word + '!!!'
# Print shout_word
print(shout_word)
# Call shout with the string 'congratulations'
shout("congratulations")
# Define shout with the parameter, word
def shout(word):
"""Return a string with three exclamation marks"""
# Concatenate the strings: shout_word
shout_word = word + "!!!"
# Replace print with return
return(shout_word)
# Pass 'congratulations' to shout: yell
yell = shout("congratulations")
# Print yell
print(yell)
# Define shout with parameters word1 and word2
def shout(word1, word2):
"""Concatenate strings with three exclamation marks"""
# Concatenate word1 with '!!!': shout1
shout1 = word1 + "!!!"
# Concatenate word2 with '!!!': shout2
shout2 = word2 + "!!!"
# Concatenate shout1 with shout2: new_shout
new_shout = shout1 + shout2
# Return new_shout
return new_shout
# Pass 'congratulations' and 'you' to shout(): yell
yell = shout("congratulations", "you")
# Print yell
print(yell)
# Set up the nums tuple for later access
nums = (3, 4, 6)
# Unpack nums into num1, num2, and num3
num1, num2, num3 = nums
# Construct even_nums
even_nums = (2, num2, num3)
# Define shout_all with parameters word1 and word2
def shout_all(word1, word2):
# Concatenate word1 with '!!!': shout1
shout1 = word1 + "!!!"
# Concatenate word2 with '!!!': shout2
shout2 = word2 + "!!!"
# Construct a tuple with shout1 and shout2: shout_words
shout_words = (shout1, shout2)
# Return shout_words
return shout_words
# Pass 'congratulations' and 'you' to shout_all(): yell1, yell2
yell1, yell2 = shout_all("congratulations", "you")
# Print yell1 and yell2
print(yell1)
print(yell2)
# Import pandas
import pandas as pd
# DO NOT HAVE THIS CSV; CAN JUST MAKE A COLUMN WITH A SINGLE WORD FOR THE EXAMPLE
# Import Twitter data as DataFrame: df
df = pd.read_csv("tweets.csv")
# Initialize an empty dictionary: langs_count
langs_count = {}
# Extract column from DataFrame: col
col = df['lang']
# Iterate over lang column in DataFrame
for entry in col:
# If the language is in langs_count, add 1
if entry in langs_count.keys():
langs_count[entry] = langs_count[entry] + 1
# Else add the language to langs_count, set the value to 1
else:
langs_count[entry] = 1
# Print the populated dictionary
print(langs_count)
# Define count_entries()
def count_entries(df, col_name):
"""Return a dictionary with counts of
occurrences as value for each key."""
# Initialize an empty dictionary: langs_count
langs_count = {}
# Extract column from DataFrame: col
col = df[col_name]
# Iterate over lang column in DataFrame
for entry in col:
# If the language is in langs_count, add 1
if entry in langs_count.keys():
langs_count[entry] = langs_count[entry] + 1
# Else add the language to langs_count, set the value to 1
else:
langs_count[entry] = 1
# Return the langs_count dictionary
return(langs_count)
# NEED TO CREATE tweets_df such that it contains a column 'lang'
# Call count_entries(): result
tweets_df = df
result = count_entries(tweets_df, "lang")
# Print the result
print(result)
## congratulations!!!
## congratulations!!!
## congratulations!!!
## congratulations!!!you!!!
## congratulations!!!
## you!!!
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}
Chapter 2 - Default arguments and variable-length arguments
Scope (where are objects or names accessible) and user-defined functions:
Nested functions - one function defined inside another function:
Default and flexible arguments - arguments used when they are not specified, or when a flexible number of arguments can be passed:
Bringing it all together - case study on processing a data frame to get word counts, defaulted to column ‘lang’:
Example code includes:
# Create a string: team
team = "teen titans"
# Define change_team()
def change_team():
"""Change the value of the global variable team."""
# Use team in global scope
global team
# Change the value of team in global: team
team = "justice league"
# Print team
print(team)
# Call change_team()
change_team()
# Print team
print(team)
# Define three_shouts
def three_shouts(word1, word2, word3):
"""Returns a tuple of strings
concatenated with '!!!'."""
# Define inner
def inner(word):
"""Returns a string concatenated with '!!!'."""
return word + '!!!'
# Return a tuple of strings
return (inner(word1), inner(word2), inner(word3))
# Call three_shouts() and print
print(three_shouts('a', 'b', 'c'))
# Define echo
def echo(n):
"""Return the inner_echo function."""
# Define inner_echo
def inner_echo(word1):
"""Concatenate n copies of word1."""
echo_word = word1 * n
return echo_word
# Return inner_echo
return inner_echo
# Call echo: twice
twice = echo(2)
# Call echo: thrice
thrice = echo(3)
# Call twice() and thrice() then print
print(twice('hello'), thrice('hello'))
# Define echo_shout()
def echo_shout(word):
"""Change the value of a nonlocal variable"""
# Concatenate word with itself: echo_word
echo_word = word + word
#Print echo_word
print(echo_word)
# Define inner function shout()
def shout():
"""Alter a variable in the enclosing scope"""
#Use echo_word in nonlocal scope
nonlocal echo_word
#Change echo_word to echo_word concatenated with '!!!'
echo_word = echo_word + "!!!"
# Call function shout()
shout()
#Print echo_word
print(echo_word)
#Call function echo_shout() with argument 'hello'
echo_shout("hello")
# Define shout_echo
def shout_echo(word1, echo=1):
"""Concatenate echo copies of word1 and three
exclamation marks at the end of the string."""
# Concatenate echo copies of word1 using *: echo_word
echo_word = word1 * echo
# Concatenate '!!!' to echo_word: shout_word
shout_word = echo_word + '!!!'
# Return shout_word
return shout_word
# Call shout_echo() with "Hey": no_echo
no_echo = shout_echo("Hey")
# Call shout_echo() with "Hey" and echo=5: with_echo
with_echo = shout_echo("Hey", 5)
# Print no_echo and with_echo
print(no_echo)
print(with_echo)
# Define shout_echo
def shout_echo(word1, echo=1, intense=False):
"""Concatenate echo copies of word1 and three
exclamation marks at the end of the string."""
# Concatenate echo copies of word1 using *: echo_word
echo_word = word1 * echo
# Capitalize echo_word if intense is True
if intense is True:
# Capitalize and concatenate '!!!': echo_word_new
echo_word_new = echo_word.upper() + '!!!'
else:
# Concatenate '!!!' to echo_word: echo_word_new
echo_word_new = echo_word + '!!!'
# Return echo_word_new
return echo_word_new
# Call shout_echo() with "Hey", echo=5 and intense=True: with_big_echo
with_big_echo = shout_echo("Hey", 5, True)
# Call shout_echo() with "Hey" and intense=True: big_no_echo
big_no_echo = shout_echo("Hey", intense=True)
# Print values
print(with_big_echo)
print(big_no_echo)
# Define gibberish
def gibberish(*args):
"""Concatenate strings in *args together."""
# Initialize an empty string: hodgepodge
hodgepodge = ""
# Concatenate the strings in args
for word in args:
hodgepodge += word
# Return hodgepodge
return(hodgepodge)
# Call gibberish() with one string: one_word
one_word = gibberish("luke")
# Call gibberish() with five strings: many_words
many_words = gibberish("luke", "leia", "han", "obi", "darth")
# Print one_word and many_words
print(one_word)
print(many_words)
# Define report_status
def report_status(**kwargs):
"""Print out the status of a movie character."""
print("\nBEGIN: REPORT\n")
# Iterate over the key-value pairs of kwargs
for key, value in kwargs.items():
# Print out the keys and values, separated by a colon ':'
print(key + ": " + value)
print("\nEND REPORT")
# First call to report_status()
report_status(name="luke", affiliation="jedi", status="missing")
# Second call to report_status()
report_status(name="anakin", affiliation="sith lord", status="deceased")
# DO NOT HAVE file tweets_df (may need to create some dummy data . . . )
import pandas as pd
tweets_df = pd.read_csv("tweets.csv")
# Define count_entries()
def count_entries(df, col_name="lang"):
"""Return a dictionary with counts of
occurrences as value for each key."""
# Initialize an empty dictionary: cols_count
cols_count = {}
# Extract column from DataFrame: col
col = df[col_name]
# Iterate over the column in DataFrame
for entry in col:
# If entry is in cols_count, add 1
if entry in cols_count.keys():
cols_count[entry] += 1
# Else add the entry to cols_count, set the value to 1
else:
cols_count[entry] = 1
# Return the cols_count dictionary
return cols_count
# Call count_entries(): result1
result1 = count_entries(tweets_df)
# Call count_entries(): result2
result2 = count_entries(tweets_df, "source")
# Print result1 and result2
print(result1)
print(result2)
# Define count_entries()
def count_entries(df, *args):
"""Return a dictionary with counts of
occurrences as value for each key."""
#Initialize an empty dictionary: cols_count
cols_count = {}
# Iterate over column names in args
for col_name in args:
# Extract column from DataFrame: col
col = df[col_name]
# Iterate over the column in DataFrame
for entry in col:
# If entry is in cols_count, add 1
if entry in cols_count.keys():
cols_count[entry] += 1
# Else add the entry to cols_count, set the value to 1
else:
cols_count[entry] = 1
# Return the cols_count dictionary
return cols_count
# Call count_entries(): result1
result1 = count_entries(tweets_df, "lang")
# Call count_entries(): result2
result2 = count_entries(tweets_df, "lang", "source")
# Print result1 and result2
print(result1)
print(result2)
## teen titans
## justice league
## ('a!!!', 'b!!!', 'c!!!')
## hellohello hellohellohello
## hellohello
## hellohello!!!
## Hey!!!
## HeyHeyHeyHeyHey!!!
## HEYHEYHEYHEYHEY!!!
## HEY!!!
## luke
## lukeleiahanobidarth
##
## BEGIN: REPORT
##
## name: luke
## affiliation: jedi
## status: missing
##
## END REPORT
##
## BEGIN: REPORT
##
## name: anakin
## affiliation: sith lord
## status: deceased
##
## END REPORT
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}
## {'C': 60, 'A': 57, 'D': 35, 'B': 48}
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18, 'C': 60, 'A': 57, 'D': 35, 'B': 48}
Chapter 3 - Lambda functions and error handling
Lambda functions - quicker way to write functions on the fly:
Introduction to error handling - functions generally return an error if something is wrong, though that can be trapped/over-ridden:
Bringing it all together:
Example code includes:
# Define echo_word as a lambda function: echo_word
echo_word = (lambda word1, echo : word1 * echo)
# Call echo_word: result
result = echo_word("hey", 5)
# Print result
print(result)
# Create a list of strings: spells
spells = ["protego", "accio", "expecto patronum", "legilimens"]
# Use map() to apply a lambda function over spells: shout_spells
shout_spells = map(lambda a : a + "!!!", spells)
# Convert shout_spells to a list: shout_spells_list
shout_spells_list = list(shout_spells)
# Convert shout_spells into a list and print it
print(shout_spells_list)
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Use filter() to apply a lambda function over fellowship: result
result = filter(lambda a : len(a) > 6, fellowship)
# Convert result to a list: result_list
result_list = list(result)
# Convert result into a list and print it
print(result_list)
# Import reduce from functools
from functools import reduce
# Create a list of strings: stark
stark = ['robb', 'sansa', 'arya', 'eddard', 'jon']
# Use reduce() to apply a lambda function over stark: result
result = reduce(lambda item1, item2 : item1 + item2, stark)
# Print the result
print(result)
# Define shout_echo
def shout_echo(word1, echo=1):
"""Concatenate echo copies of word1 and three
exclamation marks at the end of the string."""
# Initialize empty strings: echo_word, shout_words
echo_word = ""
shout_words = ""
# Add exception handling with try-except
try:
# Concatenate echo copies of word1 using *: echo_word
echo_word = word1 * echo
# Concatenate '!!!' to echo_word: shout_words
shout_words = echo_word + "!!!"
except:
# Print error message
print("word1 must be a string and echo must be an integer.")
# Return shout_words
return shout_words
# Call shout_echo
shout_echo("particle", echo="accelerator")
# Define shout_echo
def shout_echo(word1, echo=1):
"""Concatenate echo copies of word1 and three
exclamation marks at the end of the string."""
# Raise an error with raise
if echo < 0:
raise ValueError('echo must be greater than 0')
# Concatenate echo copies of word1 using *: echo_word
echo_word = word1 * echo
# Concatenate '!!!' to echo_word: shout_word
shout_word = echo_word + '!!!'
# Return shout_word
return shout_word
# Call shout_echo
shout_echo("particle", echo=5)
# DO NOT HAVE file tweets_df (made "tweets.csv" using R)
import pandas as pd
tweets_df = pd.read_csv("tweets.csv")
# Select retweets from the Twitter DataFrame: result
result = filter(lambda x : x[0:2] == "RT", tweets_df["text"])
# Create list from filter object result: res_list
res_list = list(result)
# Print all retweets in res_list
for tweet in res_list:
print(tweet)
# Define count_entries()
def count_entries(df, col_name='lang'):
"""Return a dictionary with counts of
occurrences as value for each key."""
# Initialize an empty dictionary: cols_count
cols_count = {}
# Add try block
try:
# Extract column from DataFrame: col
col = df[col_name]
# Iterate over the column in dataframe
for entry in col:
# If entry is in cols_count, add 1
if entry in cols_count.keys():
cols_count[entry] += 1
# Else add the entry to cols_count, set the value to 1
else:
cols_count[entry] = 1
# Return the cols_count dictionary
return cols_count
# Add except block
except:
print('The DataFrame does not have a ' + col_name + ' column.')
# DO NOT HAVE file tweets_df
# Call count_entries(): result1
result1 = count_entries(tweets_df, 'lang')
# Print result1
print(result1)
# Call count_entries(): result2
result2 = count_entries(tweets_df, 'lang1')
# Define count_entries()
def count_entries(df, col_name='lang'):
"""Return a dictionary with counts of
occurrences as value for each key."""
# Raise a ValueError if col_name is NOT in DataFrame
if col_name not in df.columns:
raise ValueError('The DataFrame does not have a ' + col_name + ' column.')
# Initialize an empty dictionary: cols_count
cols_count = {}
# Extract column from DataFrame: col
col = df[col_name]
# Iterate over the column in DataFrame
for entry in col:
# If entry is in cols_count, add 1
if entry in cols_count.keys():
cols_count[entry] += 1
# Else add the entry to cols_count, set the value to 1
else:
cols_count[entry] = 1
# Return the cols_count dictionary
return cols_count
# Call count_entries(): result1
result1 = count_entries(tweets_df, "lang")
# Print result1
print(result1)
# CAREFUL, THIS ONE IS DESIGNED TO RAISE THE ERROR!
# count_entries(tweets_df, 'lang1')
## heyheyheyheyhey
## ['protego!!!', 'accio!!!', 'expecto patronum!!!', 'legilimens!!!']
## ['samwise', 'aragorn', 'legolas', 'boromir']
## robbsansaaryaeddardjon
## word1 must be a string and echo must be an integer.
## RT H
## RT F
## RT H
## RT H
## RT G
## RT G
## RT G
## RT E
## RT E
## RT E
## RT H
## RT E
## RT F
## RT G
## RT G
## RT E
## RT G
## RT E
## RT E
## RT H
## RT G
## RT E
## RT G
## RT G
## RT F
## RT H
## RT H
## RT E
## RT H
## RT G
## RT F
## RT F
## RT F
## RT H
## RT G
## RT E
## RT E
## RT H
## RT F
## RT G
## RT H
## RT E
## RT H
## RT G
## RT F
## RT E
## RT F
## RT E
## RT E
## RT H
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}
## The DataFrame does not have a lang1 column.
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}
Chapter 1 - Using iterators in PythonLand
Introduction to iterators - for loops and the like:
Playing with iterators - enumerate and zip:
Using iterators to load large files in to memory - loading data in chunks:
Example code includes:
# Create a list of strings: flash
flash = ['jay garrick', 'barry allen', 'wally west', 'bart allen']
# Print each list item in flash using a for loop
for person in flash : print(person)
# Create an iterator for flash: superspeed
superspeed = iter(flash)
# Print each item from the iterator
print(next(superspeed))
print(next(superspeed))
print(next(superspeed))
print(next(superspeed))
# Create an iterator for range(3): small_value
small_value = iter(range(3))
# Print the values in small_value
print(next(small_value))
print(next(small_value))
print(next(small_value))
# Loop over range(3) and print the values
for num in range(3) : print(num)
# Create an iterator for range(10 ** 100): googol
googol = iter(range(10 ** 100))
# Print the first 5 values from googol
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))
# Create a range object: values
values = range(10, 21)
# Print the range object
print(values)
# Create a list of integers: values_list
values_list = list(values)
# Print values_list
print(values_list)
# Get the sum of values: values_sum
values_sum = sum(values)
# Print values_sum
print(values_sum)
# Create a list of strings: mutants
mutants = ['charles xavier',
'bobby drake',
'kurt wagner',
'max eisenhardt',
'kitty pride']
# Create a list of tuples: mutant_list
mutant_list = list(enumerate(mutants))
# Print the list of tuples
print(mutant_list)
# Unpack and print the tuple pairs
for index1, value1 in mutant_list :
print(index1, value1)
# Change the start index
for index2, value2 in list(enumerate(mutants, start=1)) :
print(index2, value2)
aliases = ['prof x', 'iceman', 'nightcrawler', 'magneto', 'shadowcat']
powers = ['telepathy', 'thermokinesis', 'teleportation', 'magnetokinesis', 'intangibility' ]
# Create a list of tuples: mutant_data
mutant_data = list(zip(mutants, aliases, powers))
# Print the list of tuples
print(mutant_data)
# Create a zip object using the three lists: mutant_zip
mutant_zip = zip(mutants, aliases, powers)
# Print the zip object
print(mutant_zip)
# Unpack the zip object and print the tuple values
for value1, value2, value3 in mutant_zip :
print(value1, value2, value3)
# Create a zip object from mutants and powers: z1
z1 = zip(mutants, powers)
# Print the tuples in z1 by unpacking with *
print(*z1)
# Re-create a zip object from mutants and powers: z1
z1 = zip(mutants, powers)
# 'Unzip' the tuples in z1 by unpacking with * and zip(): result1, result2
result1, result2 = zip(*z1)
# Check if unpacked tuples are equivalent to original tuples
print(result1 == tuple(mutants))
print(result2 == tuple(powers))
import pandas as pd
# Initialize an empty dictionary: counts_dict
counts_dict = dict()
# DO NOT HAVE FILE tweets.csv
# Created in R - see above for code
# Iterate over the file chunk by chunk
for chunk in pd.read_csv("tweets.csv", chunksize=10):
# Iterate over the column in DataFrame
for entry in chunk['lang']:
if entry in counts_dict.keys():
counts_dict[entry] += 1
else:
counts_dict[entry] = 1
# Print the populated dictionary
print(counts_dict)
# Define count_entries()
def count_entries(csv_file, c_size, colname):
"""Return a dictionary with counts of
occurrences as value for each key."""
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Iterate over the file chunk by chunk
for chunk in pd.read_csv(csv_file, chunksize=c_size):
# Iterate over the column in DataFrame
for entry in chunk[colname]:
if entry in counts_dict.keys():
counts_dict[entry] += 1
else:
counts_dict[entry] = 1
# Return counts_dict
return counts_dict
# Call count_entries(): result_counts
result_counts = count_entries("tweets.csv", 10, "lang")
# Print result_counts
print(result_counts)
## jay garrick
## barry allen
## wally west
## bart allen
## jay garrick
## barry allen
## wally west
## bart allen
## 0
## 1
## 2
## 0
## 1
## 2
## 0
## 1
## 2
## 3
## 4
## range(10, 21)
## [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
## 165
## [(0, 'charles xavier'), (1, 'bobby drake'), (2, 'kurt wagner'), (3, 'max eisenhardt'), (4, 'kitty pride')]
## 0 charles xavier
## 1 bobby drake
## 2 kurt wagner
## 3 max eisenhardt
## 4 kitty pride
## 1 charles xavier
## 2 bobby drake
## 3 kurt wagner
## 4 max eisenhardt
## 5 kitty pride
## [('charles xavier', 'prof x', 'telepathy'), ('bobby drake', 'iceman', 'thermokinesis'), ('kurt wagner', 'nightcrawler', 'teleportation'), ('max eisenhardt', 'magneto', 'magnetokinesis'), ('kitty pride', 'shadowcat', 'intangibility')]
## <zip object at 0x005D20D0>
## charles xavier prof x telepathy
## bobby drake iceman thermokinesis
## kurt wagner nightcrawler teleportation
## max eisenhardt magneto magnetokinesis
## kitty pride shadowcat intangibility
## ('charles xavier', 'telepathy') ('bobby drake', 'thermokinesis') ('kurt wagner', 'teleportation') ('max eisenhardt', 'magnetokinesis') ('kitty pride', 'intangibility')
## True
## True
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}
## {'en': 159, 'fr': 10, 'it': 13, 'sp': 18}
Chapter 2 - List comprehensions and generators
List comprehensions help address some of the inefficiencies (coding, run time, etc.) of using for loops for some tasks:
Advanced comprehensions - additional functionality available:
Introduction to generator expressions - creating generator objects rather than list/dictionaries:
Wrapping up comprehensions and generators - helps with wrangling data:
Example code includes:
doctor = ['house', 'cuddy', 'chase', 'thirteen', 'wilson']
[doc[0] for doc in doctor]
# Create list comprehension: squares
squares = [i ** 2 for i in range(0, 10)]
# Create a 5 x 5 matrix using a list of lists: matrix
matrix = [[col for col in range(5)] for row in range(5)]
# Print the matrix
for row in matrix:
print(row)
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Create list comprehension: new_fellowship
new_fellowship = [member for member in fellowship if len(member) >= 7]
# Print the new list
print(new_fellowship)
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Create list comprehension: new_fellowship
new_fellowship = [member if len(member) >= 7 else "" for member in fellowship]
# Print the new list
print(new_fellowship)
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Create dict comprehension: new_fellowship
new_fellowship = {member : len(member) for member in fellowship}
# Print the new list
print(new_fellowship)
# Create generator object: result
result = (num for num in range(16))
# Print the first 5 values
print(next(result))
print(next(result))
print(next(result))
print(next(result))
print(next(result))
# Print the rest of the values
# NOTE - only will print 5-15 since 0-4 have previously been "consumed" above
for value in result:
print(value)
# Create a list of strings: lannister
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']
# Create a generator object: lengths
lengths = (len(person) for person in lannister)
# Iterate over and print the values in lengths
for value in lengths:
print(value)
# Create a list of strings
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']
# Define generator function get_lengths
def get_lengths(input_list):
"""Generator function that yields the
length of the strings in input_list."""
# Yield the length of a string
for person in input_list:
yield len(person)
# Print the values generated by get_lengths()
for value in get_lengths(lannister):
print(value)
# DO NOT HAVE panda "df"
# Extract the created_at column from df: tweet_time
# tweet_time = df["created_at"]
# Extract the clock time: tweet_clock_time
# tweet_clock_time = [entry[11:19] for entry in tweet_time]
# Print the extracted times
# print(tweet_clock_time)
# Extract the created_at column from df: tweet_time
# tweet_time = df['created_at']
# Extract the clock time: tweet_clock_time
# tweet_clock_time = [entry[11:19] for entry in tweet_time if entry[17:19] == "19"]
# Print the extracted times
# print(tweet_clock_time)
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## [0, 1, 2, 3, 4]
## ['samwise', 'aragorn', 'legolas', 'boromir']
## ['', 'samwise', '', 'aragorn', 'legolas', 'boromir', '']
## {'frodo': 5, 'samwise': 7, 'merry': 5, 'aragorn': 7, 'legolas': 7, 'boromir': 7, 'gimli': 5}
## 0
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 6
## 5
## 5
## 6
## 7
## 6
## 5
## 5
## 6
## 7
Chapter 3 - Bringing it all together (case study)
Welcome to the case study - previous two course techniques:
Using Python generators for streaming data:
Reading files in chunks with pandas.read_csv():
Example code includes:
row_vals = [ 'Arab World', 'ARB', 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'SP.ADO.TFRT', '1960', '133.56090740552298' ]
feature_names = [ 'CountryName', 'CountryCode', 'IndicatorName', 'IndicatorCode', 'Year', 'Value' ]
# Zip lists: zipped_lists
zipped_lists = zip(feature_names, row_vals)
# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)
# Print the dictionary
print(rs_dict)
# Define lists2dict()
def lists2dict(list1, list2):
"""Return a dictionary where list1 provides
the keys and list2 provides the values."""
# Zip lists: zipped_lists
zipped_lists = zip(list1, list2)
# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)
# Return the dictionary
return rs_dict
# Call lists2dict: rs_fxn
rs_fxn = lists2dict(feature_names, row_vals)
# Print rs_fxn
print(rs_fxn)
# Create list row_lists
regn = ['Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World', 'Arab World']
abb = ['ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB', 'ARB']
indName = ['Adolescent fertility rate (births per 1,000 women ages 15-19)', 'Age dependency ratio (% of working-age population)', 'Age dependency ratio, old (% of working-age population)', 'Age dependency ratio, young (% of working-age population)', 'Arms exports (SIPRI trend indicator values)', 'Arms imports (SIPRI trend indicator values)', 'Birth rate, crude (per 1,000 people)', 'CO2 emissions (kt)', 'CO2 emissions (metric tons per capita)', 'CO2 emissions from gaseous fuel consumption (% of total)', 'CO2 emissions from liquid fuel consumption (% of total)', 'CO2 emissions from liquid fuel consumption (kt)', 'CO2 emissions from solid fuel consumption (% of total)', 'Death rate, crude (per 1,000 people)', 'Fertility rate, total (births per woman)', 'Fixed telephone subscriptions', 'Fixed telephone subscriptions (per 100 people)', 'Hospital beds (per 1,000 people)', 'International migrant stock (% of population)', 'International migrant stock, total' ]
indCode = ['SP.ADO.TFRT', 'SP.POP.DPND', 'SP.POP.DPND.OL', 'SP.POP.DPND.YG', 'MS.MIL.XPRT.KD', 'MS.MIL.MPRT.KD', 'SP.DYN.CBRT.IN', 'EN.ATM.CO2E.KT', 'EN.ATM.CO2E.PC', 'EN.ATM.CO2E.GF.ZS', 'EN.ATM.CO2E.LF.ZS', 'EN.ATM.CO2E.LF.KT', 'EN.ATM.CO2E.SF.ZS', 'SP.DYN.CDRT.IN', 'SP.DYN.TFRT.IN', 'IT.MLT.MAIN', 'IT.MLT.MAIN.P2', 'SH.MED.BEDS.ZS', 'SM.POP.TOTL.ZS', 'SM.POP.TOTL']
year = ['1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960', '1960']
value = ['133.56090740552298', '87.7976011532547', '6.634579191565161', '81.02332950839141', '3000000.0', '538000000.0', '47.697888095096395', '59563.9892169935', '0.6439635478877049', '5.041291753975099', '84.8514729446567', '49541.707291032304', '4.72698138789597', '19.7544519237187', '6.92402738655897', '406833.0', '0.6167005703199', '1.9296220724398703', '2.9906371279862403', '3324685.0']
row_lists=list(zip(regn, abb, indName, indCode, year, value))
# Print the first two lists in row_lists
print(row_lists[0])
print(row_lists[1])
# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]
# Print the first two dictionaries in list_of_dicts
print(list_of_dicts[0])
print(list_of_dicts[1])
# Import the pandas package
import pandas as pd
# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]
# Turn list of dicts into a DataFrame: df
df = pd.DataFrame(list_of_dicts)
# Print the head of the DataFrame
print(df.head())
# REFERENCE DATA POSSIBLY AT http://data.worldbank.org/data-catalog/world-development-indicators
# Created relevant file "world_dev_ind.csv" using Python and World Bank download
# Open a connection to the file
with open("world_dev_ind.csv") as file:
# Skip the column names
file.readline()
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Process only the first 1000 rows
for j in range(1000):
# Split the current line into a list: line
line = file.readline().split(',')
# Get the value for the first column: first_col
first_col = line[0]
# If the column value is in the dict, increment its value
if first_col in counts_dict.keys():
counts_dict[first_col] += 1
# Else, add to the dict and set value to 1
else:
counts_dict[first_col] = 1
# Print the resulting dictionary
print(counts_dict)
# Define read_large_file()
def read_large_file(file_object):
"""A generator function to read a large file lazily."""
# Loop indefinitely until the end of the file
while True:
# Read a line from the file: data
data = file_object.readline()
# Break if this is the end of the file
if not data:
break
# Yield the line of data
yield data
# Open a connection to the file
with open('world_dev_ind.csv') as file:
# Create a generator object for the file: gen_file
gen_file = read_large_file(file)
# Print the first three lines of the file
print(next(gen_file))
print(next(gen_file))
print(next(gen_file))
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Open a connection to the file
with open("world_dev_ind.csv") as file:
# Iterate over the generator from read_large_file()
for line in read_large_file(file):
row = line.split(',')
first_col = row[0]
if first_col in counts_dict.keys():
counts_dict[first_col] += 1
else:
counts_dict[first_col] = 1
# Print
print(counts_dict)
# DO NOT HAVE FILE ind_pop.csv (CountryName,CountryCode,IndicatorName,IndicatorCode,Year,Value\n)
# Value for regions of CountryName/CountryCode - fixing Urban population (% of total), SP.URB.TOTL.IN.ZS , 1960
# Just changed it to use "world_dev_ind.csv"
# Import the pandas package
import pandas as pd
import matplotlib.pyplot as plt
# Initialize reader object: df_reader
df_reader = pd.read_csv("world_dev_ind.csv", chunksize=10)
# Print two chunks
print(next(df_reader))
print(next(df_reader))
# DO NOT HAVE FILE ind_pop_data.csv
# ('CountryName,CountryCode,Year,Total Population,Urban population (% of total)\n)
# Appears to be 1960-1964
# Initialize reader object: urb_pop_reader
# Create file using Python, needs to read in using encoding="latin-1"
urb_pop_reader = pd.read_csv("ind_pop_data.csv", chunksize=2500, encoding="latin-1")
# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)
# Check out the head of the DataFrame
print(df_urb_pop.head())
# Check out specific country: df_pop_ceb
idxCeb = df_urb_pop[df_urb_pop["CountryCode"] == "CEB"].index
df_pop_ceb = df_urb_pop.loc[idxCeb, :] # Make sure it is not just a reference . . .
# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb["Total Population"], df_pop_ceb["Urban population (% of total)"])
# Turn zip object into list: pops_list
pops_list = list(pops)
# Print pops_list
print(pops_list)
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv("ind_pop_data.csv", chunksize=2500, encoding="latin-1")
# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)
# Check out specific country: df_pop_ceb
idxCeb = df_urb_pop[df_urb_pop["CountryCode"] == "CEB"].index
df_pop_ceb = df_urb_pop.loc[idxCeb, :] # Make sure it is not just a reference . . .
# df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']
# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])
# Turn zip object into list: pops_list
pops_list = list(pops)
# Use list comprehension to create new DataFrame column 'Total Urban Population'
# df_pop_ceb["Total Urban Population"] = df_pop_ceb["Total Population"]
# a = [int(0.01 * tup[0] * tup[1]) for tup in pops_list]
df_pop_ceb['Total Urban Population'] = [int(0.01 * tup[0] * tup[1]) for tup in pops_list]
# Plot urban population data
df_pop_ceb.plot(kind="scatter", x="Year", y="Total Urban Population")
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy020.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000, encoding="latin-1")
# Initialize empty DataFrame: data
data = pd.DataFrame()
# Iterate over each DataFrame chunk
for df_urb_pop in urb_pop_reader:
# Check out specific country: df_pop_ceb
idxCeb = df_urb_pop[df_urb_pop["CountryCode"] == "CEB"].index
df_pop_ceb = df_urb_pop.loc[idxCeb, :] # Make sure it is not just a reference . . .
# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])
# Turn zip object into list: pops_list
pops_list = list(pops)
# Use list comprehension to create new DataFrame column 'Total Urban Population'
# df_pop_ceb["Total Urban Population"] = df_pop_ceb["Total Population"]
# a = [int(0.01 * tup[0] * tup[1]) for tup in pops_list]
df_pop_ceb['Total Urban Population'] = [int(0.01 * tup[0] * tup[1]) for tup in pops_list]
# Append DataFrame chunk to data: data
data = data.append(df_pop_ceb)
# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
# plt.show()
# Save as dummy PNG instead
plt.savefig("_dummyPy021.png", bbox_inches="tight")
plt.clf() # Required to prevent continued over-plotting
# Define plot_pop()
def plot_pop(filename, country_code, pngCode=False):
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv(filename, chunksize=1000, encoding="latin-1")
# Initialize empty DataFrame: data
data = pd.DataFrame()
# Iterate over each DataFrame chunk
for df_urb_pop in urb_pop_reader:
# Check out specific country: df_pop_ceb
idxCeb = df_urb_pop[df_urb_pop["CountryCode"] == country_code].index
df_pop_ceb = df_urb_pop.loc[idxCeb, :] # Make sure it is not just a reference . . .
# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])
# Turn zip object into list: pops_list
pops_list = list(pops)
# Use list comprehension to create new DataFrame column 'Total Urban Population'
# df_pop_ceb["Total Urban Population"] = df_pop_ceb["Total Population"]
# a = [int(0.01 * tup[0] * tup[1]) for tup in pops_list]
# df_pop_ceb.loc[df_pop_ceb.index, 'Total Urban Population'] = a
df_pop_ceb['Total Urban Population'] = [int(0.01 * tup[0] * tup[1]) for tup in pops_list]
# Append DataFrame chunk to data: data
data = data.append(df_pop_ceb)
# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
if pngCode == False :
plt.show() # Plot by default
else :
plt.savefig(pngCode, bbox_inches="tight") # Save as dummy PNG instead
plt.clf() # Required to prevent continued over-plotting
# Set the filename: fn
fn = 'ind_pop_data.csv'
# Call plot_pop for country code 'CEB'
plot_pop(fn, "CEB", "_dummyPy022.png")
# Call plot_pop for country code 'ARB'
plot_pop(fn, "ARB", "_dummyPy023.png")
## {'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}
## {'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}
## ('Arab World', 'ARB', 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'SP.ADO.TFRT', '1960', '133.56090740552298')
## ('Arab World', 'ARB', 'Age dependency ratio (% of working-age population)', 'SP.POP.DPND', '1960', '87.7976011532547')
## {'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}
## {'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Age dependency ratio (% of working-age population)', 'IndicatorCode': 'SP.POP.DPND', 'Year': '1960', 'Value': '87.7976011532547'}
## CountryCode CountryName IndicatorCode \
## 0 ARB Arab World SP.ADO.TFRT
## 1 ARB Arab World SP.POP.DPND
## 2 ARB Arab World SP.POP.DPND.OL
## 3 ARB Arab World SP.POP.DPND.YG
## 4 ARB Arab World MS.MIL.XPRT.KD
##
## IndicatorName Value Year
## 0 Adolescent fertility rate (births per 1,000 wo... 133.56090740552298 1960
## 1 Age dependency ratio (% of working-age populat... 87.7976011532547 1960
## 2 Age dependency ratio, old (% of working-age po... 6.634579191565161 1960
## 3 Age dependency ratio, young (% of working-age ... 81.02332950839141 1960
## 4 Arms exports (SIPRI trend indicator values) 3000000.0 1960
## {'Arab World': 6, 'Caribbean small states': 6, 'Central Europe and the Baltics': 6, 'Early-demographic dividend': 6, 'East Asia & Pacific': 6, 'East Asia & Pacific (excluding high income)': 6, 'East Asia & Pacific (IDA & IBRD countries)': 6, 'Euro area': 6, 'Europe & Central Asia': 6, 'Europe & Central Asia (excluding high income)': 6, 'Europe & Central Asia (IDA & IBRD countries)': 6, 'European Union': 6, 'Fragile and conflict affected situations': 6, 'Heavily indebted poor countries (HIPC)': 6, 'High income': 6, 'IBRD only': 6, 'IDA & IBRD total': 6, 'IDA blend': 6, 'IDA only': 6, 'IDA total': 6, 'Late-demographic dividend': 6, 'Latin America & Caribbean': 6, 'Latin America & Caribbean (excluding high income)': 6, 'Latin America & the Caribbean (IDA & IBRD countries)': 6, 'Least developed countries: UN classification': 6, 'Low & middle income': 6, 'Low income': 6, 'Lower middle income': 6, 'Middle East & North Africa': 6, 'Middle East & North Africa (excluding high income)': 6, 'Middle East & North Africa (IDA & IBRD countries)': 6, 'Middle income': 6, 'North America': 6, 'OECD members': 6, 'Other small states': 6, 'Pacific island small states': 6, 'Post-demographic dividend': 6, 'Pre-demographic dividend': 6, 'Small states': 6, 'South Asia': 6, 'South Asia (IDA & IBRD)': 6, 'Sub-Saharan Africa': 6, 'Sub-Saharan Africa (excluding high income)': 6, 'Sub-Saharan Africa (IDA & IBRD countries)': 6, 'Upper middle income': 6, 'World': 6, 'Afghanistan': 6, 'Albania': 6, 'Algeria': 6, 'American Samoa': 6, 'Andorra': 6, 'Angola': 6, 'Antigua and Barbuda': 6, 'Argentina': 6, 'Armenia': 6, 'Aruba': 6, 'Australia': 6, 'Austria': 6, 'Azerbaijan': 6, '"Bahamas': 6, 'Bahrain': 6, 'Bangladesh': 6, 'Barbados': 6, 'Belarus': 6, 'Belgium': 6, 'Belize': 6, 'Benin': 6, 'Bermuda': 6, 'Bhutan': 6, 'Bolivia': 6, 'Bosnia and Herzegovina': 6, 'Botswana': 6, 'Brazil': 6, 'British Virgin Islands': 6, 'Brunei Darussalam': 3, 'Bulgaria': 3, 'Burkina Faso': 3, 'Burundi': 3, 'Cabo Verde': 3, 'Cambodia': 3, 'Cameroon': 3, 'Canada': 3, 'Cayman Islands': 3, 'Central African Republic': 3, 'Chad': 3, 'Channel Islands': 3, 'Chile': 3, 'China': 3, 'Colombia': 3, 'Comoros': 3, '"Congo': 6, 'Costa Rica': 3, "Cote d'Ivoire": 3, 'Croatia': 3, 'Cuba': 3, 'Curacao': 3, 'Cyprus': 3, 'Czech Republic': 3, 'Denmark': 3, 'Djibouti': 3, 'Dominica': 3, 'Dominican Republic': 3, 'Ecuador': 3, '"Egypt': 3, 'El Salvador': 3, 'Equatorial Guinea': 3, 'Eritrea': 3, 'Estonia': 3, 'Ethiopia': 3, 'Faroe Islands': 3, 'Fiji': 3, 'Finland': 3, 'France': 3, 'French Polynesia': 3, 'Gabon': 3, '"Gambia': 3, 'Georgia': 3, 'Germany': 3, 'Ghana': 3, 'Gibraltar': 3, 'Greece': 3, 'Greenland': 3, 'Grenada': 3, 'Guam': 3, 'Guatemala': 3, 'Guinea': 3, 'Guinea-Bissau': 3, 'Guyana': 3, 'Haiti': 3, 'Honduras': 3, '"Hong Kong SAR': 3, 'Hungary': 3, 'Iceland': 3, 'India': 3, 'Indonesia': 3, '"Iran': 3, 'Iraq': 3, 'Ireland': 3, 'Isle of Man': 3, 'Israel': 3, 'Italy': 3, 'Jamaica': 3, 'Japan': 3, 'Jordan': 3, 'Kazakhstan': 3, 'Kenya': 3, 'Kiribati': 3, '"Korea': 6, 'Kosovo': 1, 'Kuwait': 3, 'Kyrgyz Republic': 3, 'Lao PDR': 3, 'Latvia': 3, 'Lebanon': 3, 'Lesotho': 3, 'Liberia': 3, 'Libya': 3, 'Liechtenstein': 3, 'Lithuania': 3, 'Luxembourg': 3, '"Macao SAR': 3, '"Macedonia': 3, 'Madagascar': 3, 'Malawi': 3, 'Malaysia': 3, 'Maldives': 3, 'Mali': 3, 'Malta': 3, 'Marshall Islands': 3, 'Mauritania': 3, 'Mauritius': 3, 'Mexico': 3, '"Micronesia': 3, 'Moldova': 3, 'Monaco': 3, 'Mongolia': 3, 'Montenegro': 3, 'Morocco': 3, 'Mozambique': 3, 'Myanmar': 3, 'Namibia': 3, 'Nauru': 3, 'Nepal': 3, 'Netherlands': 3, 'New Caledonia': 3, 'New Zealand': 3, 'Nicaragua': 3, 'Niger': 3, 'Nigeria': 3, 'Northern Mariana Islands': 3, 'Norway': 3, 'Oman': 3, 'Pakistan': 3, 'Palau': 3, 'Panama': 3, 'Papua New Guinea': 3, 'Paraguay': 3, 'Peru': 3, 'Philippines': 3, 'Poland': 3, 'Portugal': 3, 'Puerto Rico': 3, 'Qatar': 3, 'Romania': 3, 'Russian Federation': 3, 'Rwanda': 3, 'Samoa': 3, 'San Marino': 3, 'Sao Tome and Principe': 3, 'Saudi Arabia': 3, 'Senegal': 3, 'Seychelles': 3, 'Sierra Leone': 3, 'Singapore': 3, 'Sint Maarten (Dutch part)': 2, 'Slovak Republic': 3, 'Slovenia': 3, 'Solomon Islands': 3, 'Somalia': 3, 'South Africa': 3, 'South Sudan': 3, 'Spain': 3, 'Sri Lanka': 3, 'St. Kitts and Nevis': 3, 'St. Lucia': 3, 'St. Martin (French part)': 1, 'St. Vincent and the Grenadines': 3, 'Sudan': 3, 'Suriname': 3, 'Swaziland': 3, 'Sweden': 3, 'Switzerland': 3, 'Syrian Arab Republic': 3, 'Tajikistan': 3, 'Tanzania': 3, 'Thailand': 3, 'Timor-Leste': 3, 'Togo': 3, 'Tonga': 3, 'Trinidad and Tobago': 3, 'Tunisia': 3, 'Turkey': 3, 'Turkmenistan': 3, 'Turks and Caicos Islands': 3, 'Tuvalu': 3, 'Uganda': 3, 'Ukraine': 3, 'United Arab Emirates': 3, 'United Kingdom': 3, 'United States': 3, 'Uruguay': 3, 'Uzbekistan': 3, 'Vanuatu': 3, '"Venezuela': 3, 'Vietnam': 3, 'Virgin Islands (U.S.)': 3, '"Yemen': 3, 'Zambia': 3, 'Zimbabwe': 3}
## Country Name,Country Code,Indicator Name,Indicator Code,year,value
##
## Arab World,ARB,"Population, total",SP.POP.TOTL,1960,92496099.0
##
## Arab World,ARB,Rural population (% of total population),SP.RUR.TOTL.ZS,1960,68.7081520885329
##
## {'Country Name': 1, 'Arab World': 168, 'Caribbean small states': 168, 'Central Europe and the Baltics': 168, 'Early-demographic dividend': 168, 'East Asia & Pacific': 168, 'East Asia & Pacific (excluding high income)': 168, 'East Asia & Pacific (IDA & IBRD countries)': 168, 'Euro area': 168, 'Europe & Central Asia': 168, 'Europe & Central Asia (excluding high income)': 168, 'Europe & Central Asia (IDA & IBRD countries)': 168, 'European Union': 168, 'Fragile and conflict affected situations': 168, 'Heavily indebted poor countries (HIPC)': 168, 'High income': 168, 'IBRD only': 168, 'IDA & IBRD total': 168, 'IDA blend': 168, 'IDA only': 168, 'IDA total': 168, 'Late-demographic dividend': 168, 'Latin America & Caribbean': 168, 'Latin America & Caribbean (excluding high income)': 168, 'Latin America & the Caribbean (IDA & IBRD countries)': 168, 'Least developed countries: UN classification': 168, 'Low & middle income': 168, 'Low income': 168, 'Lower middle income': 168, 'Middle East & North Africa': 168, 'Middle East & North Africa (excluding high income)': 168, 'Middle East & North Africa (IDA & IBRD countries)': 168, 'Middle income': 168, 'North America': 168, 'OECD members': 168, 'Other small states': 168, 'Pacific island small states': 168, 'Post-demographic dividend': 168, 'Pre-demographic dividend': 168, 'Small states': 168, 'South Asia': 168, 'South Asia (IDA & IBRD)': 168, 'Sub-Saharan Africa': 168, 'Sub-Saharan Africa (excluding high income)': 168, 'Sub-Saharan Africa (IDA & IBRD countries)': 168, 'Upper middle income': 168, 'World': 168, 'Afghanistan': 168, 'Albania': 168, 'Algeria': 168, 'American Samoa': 168, 'Andorra': 168, 'Angola': 168, 'Antigua and Barbuda': 168, 'Argentina': 168, 'Armenia': 168, 'Aruba': 168, 'Australia': 168, 'Austria': 168, 'Azerbaijan': 168, '"Bahamas': 168, 'Bahrain': 168, 'Bangladesh': 168, 'Barbados': 168, 'Belarus': 168, 'Belgium': 168, 'Belize': 168, 'Benin': 168, 'Bermuda': 168, 'Bhutan': 168, 'Bolivia': 168, 'Bosnia and Herzegovina': 168, 'Botswana': 168, 'Brazil': 168, 'British Virgin Islands': 168, 'Brunei Darussalam': 168, 'Bulgaria': 168, 'Burkina Faso': 168, 'Burundi': 168, 'Cabo Verde': 168, 'Cambodia': 168, 'Cameroon': 168, 'Canada': 168, 'Cayman Islands': 168, 'Central African Republic': 168, 'Chad': 168, 'Channel Islands': 168, 'Chile': 168, 'China': 168, 'Colombia': 168, 'Comoros': 168, '"Congo': 336, 'Costa Rica': 168, "Cote d'Ivoire": 168, 'Croatia': 168, 'Cuba': 168, 'Curacao': 168, 'Cyprus': 168, 'Czech Republic': 168, 'Denmark': 168, 'Djibouti': 168, 'Dominica': 168, 'Dominican Republic': 168, 'Ecuador': 168, '"Egypt': 168, 'El Salvador': 168, 'Equatorial Guinea': 168, 'Eritrea': 156, 'Estonia': 168, 'Ethiopia': 168, 'Faroe Islands': 168, 'Fiji': 168, 'Finland': 168, 'France': 168, 'French Polynesia': 168, 'Gabon': 168, '"Gambia': 168, 'Georgia': 168, 'Germany': 168, 'Ghana': 168, 'Gibraltar': 168, 'Greece': 168, 'Greenland': 168, 'Grenada': 168, 'Guam': 168, 'Guatemala': 168, 'Guinea': 168, 'Guinea-Bissau': 168, 'Guyana': 168, 'Haiti': 168, 'Honduras': 168, '"Hong Kong SAR': 168, 'Hungary': 168, 'Iceland': 168, 'India': 168, 'Indonesia': 168, '"Iran': 168, 'Iraq': 168, 'Ireland': 168, 'Isle of Man': 168, 'Israel': 168, 'Italy': 168, 'Jamaica': 168, 'Japan': 168, 'Jordan': 168, 'Kazakhstan': 168, 'Kenya': 168, 'Kiribati': 168, '"Korea': 336, 'Kosovo': 56, 'Kuwait': 165, 'Kyrgyz Republic': 168, 'Lao PDR': 168, 'Latvia': 168, 'Lebanon': 168, 'Lesotho': 168, 'Liberia': 168, 'Libya': 168, 'Liechtenstein': 168, 'Lithuania': 168, 'Luxembourg': 168, '"Macao SAR': 168, '"Macedonia': 168, 'Madagascar': 168, 'Malawi': 168, 'Malaysia': 168, 'Maldives': 168, 'Mali': 168, 'Malta': 168, 'Marshall Islands': 168, 'Mauritania': 168, 'Mauritius': 168, 'Mexico': 168, '"Micronesia': 168, 'Moldova': 168, 'Monaco': 168, 'Mongolia': 168, 'Montenegro': 168, 'Morocco': 168, 'Mozambique': 168, 'Myanmar': 168, 'Namibia': 168, 'Nauru': 168, 'Nepal': 168, 'Netherlands': 168, 'New Caledonia': 168, 'New Zealand': 168, 'Nicaragua': 168, 'Niger': 168, 'Nigeria': 168, 'Northern Mariana Islands': 168, 'Norway': 168, 'Oman': 168, 'Pakistan': 168, 'Palau': 168, 'Panama': 168, 'Papua New Guinea': 168, 'Paraguay': 168, 'Peru': 168, 'Philippines': 168, 'Poland': 168, 'Portugal': 168, 'Puerto Rico': 168, 'Qatar': 168, 'Romania': 168, 'Russian Federation': 168, 'Rwanda': 168, 'Samoa': 168, 'San Marino': 168, 'Sao Tome and Principe': 168, 'Saudi Arabia': 168, 'Senegal': 168, 'Seychelles': 168, 'Sierra Leone': 168, 'Singapore': 168, 'Sint Maarten (Dutch part)': 130, 'Slovak Republic': 168, 'Slovenia': 168, 'Solomon Islands': 168, 'Somalia': 168, 'South Africa': 168, 'South Sudan': 168, 'Spain': 168, 'Sri Lanka': 168, 'St. Kitts and Nevis': 168, 'St. Lucia': 168, 'St. Martin (French part)': 56, 'St. Vincent and the Grenadines': 168, 'Sudan': 168, 'Suriname': 168, 'Swaziland': 168, 'Sweden': 168, 'Switzerland': 168, 'Syrian Arab Republic': 168, 'Tajikistan': 168, 'Tanzania': 168, 'Thailand': 168, 'Timor-Leste': 168, 'Togo': 168, 'Tonga': 168, 'Trinidad and Tobago': 168, 'Tunisia': 168, 'Turkey': 168, 'Turkmenistan': 168, 'Turks and Caicos Islands': 168, 'Tuvalu': 168, 'Uganda': 168, 'Ukraine': 168, 'United Arab Emirates': 168, 'United Kingdom': 168, 'United States': 168, 'Uruguay': 168, 'Uzbekistan': 168, 'Vanuatu': 168, '"Venezuela': 168, 'Vietnam': 168, 'Virgin Islands (U.S.)': 168, '"Yemen': 168, 'Zambia': 168, 'Zimbabwe': 168, 'Serbia': 78, 'West Bank and Gaza': 78}
## Country Name Country Code \
## 0 Arab World ARB
## 1 Arab World ARB
## 2 Arab World ARB
## 3 Caribbean small states CSS
## 4 Caribbean small states CSS
## 5 Caribbean small states CSS
## 6 Central Europe and the Baltics CEB
## 7 Central Europe and the Baltics CEB
## 8 Central Europe and the Baltics CEB
## 9 Early-demographic dividend EAR
##
## Indicator Name Indicator Code year \
## 0 Population, total SP.POP.TOTL 1960
## 1 Rural population (% of total population) SP.RUR.TOTL.ZS 1960
## 2 Urban population (% of total) SP.URB.TOTL.IN.ZS 1960
## 3 Population, total SP.POP.TOTL 1960
## 4 Rural population (% of total population) SP.RUR.TOTL.ZS 1960
## 5 Urban population (% of total) SP.URB.TOTL.IN.ZS 1960
## 6 Population, total SP.POP.TOTL 1960
## 7 Rural population (% of total population) SP.RUR.TOTL.ZS 1960
## 8 Urban population (% of total) SP.URB.TOTL.IN.ZS 1960
## 9 Population, total SP.POP.TOTL 1960
##
## value
## 0 9.249610e+07
## 1 6.870815e+01
## 2 3.129185e+01
## 3 4.192721e+06
## 4 6.840152e+01
## 5 3.159848e+01
## 6 9.140158e+07
## 7 5.549208e+01
## 8 4.450792e+01
## 9 9.800680e+08
## Country Name Country Code \
## 10 Early-demographic dividend EAR
## 11 Early-demographic dividend EAR
## 12 East Asia & Pacific EAS
## 13 East Asia & Pacific EAS
## 14 East Asia & Pacific EAS
## 15 East Asia & Pacific (excluding high income) EAP
## 16 East Asia & Pacific (excluding high income) EAP
## 17 East Asia & Pacific (excluding high income) EAP
## 18 East Asia & Pacific (IDA & IBRD countries) TEA
## 19 East Asia & Pacific (IDA & IBRD countries) TEA
##
## Indicator Name Indicator Code year \
## 10 Rural population (% of total population) SP.RUR.TOTL.ZS 1960
## 11 Urban population (% of total) SP.URB.TOTL.IN.ZS 1960
## 12 Population, total SP.POP.TOTL 1960
## 13 Rural population (% of total population) SP.RUR.TOTL.ZS 1960
## 14 Urban population (% of total) SP.URB.TOTL.IN.ZS 1960
## 15 Population, total SP.POP.TOTL 1960
## 16 Rural population (% of total population) SP.RUR.TOTL.ZS 1960
## 17 Urban population (% of total) SP.URB.TOTL.IN.ZS 1960
## 18 Population, total SP.POP.TOTL 1960
## 19 Rural population (% of total population) SP.RUR.TOTL.ZS 1960
##
## value
## 10 7.705007e+01
## 11 2.294993e+01
## 12 1.042480e+09
## 13 7.752853e+01
## 14 2.247147e+01
## 15 8.964930e+08
## 16 8.308232e+01
## 17 1.691768e+01
## 18 8.850532e+08
## 19 8.338348e+01
## CountryName CountryCode Year Total Population \
## 0 Afghanistan AFG 1960 8994793.0
## 1 Afghanistan AFG 1961 9164945.0
## 2 Afghanistan AFG 1962 9343772.0
## 3 Afghanistan AFG 1963 9531555.0
## 4 Afghanistan AFG 1964 9728645.0
##
## Urban population (% of total)
## 0 8.221
## 1 8.508
## 2 8.805
## 3 9.110
## 4 9.426
## [(91401583.0, 44.507921139002597), (92237118.0, 45.206665319194002), (93014890.0, 45.866564696018003), (93845749.0, 46.5340927663649), (94722599.0, 47.208742980352604), (95447065.0, 47.8803084429574), (96148635.0, 48.505097191759397), (97043587.0, 49.067767135854098), (97882394.0, 49.638696249807701), (98602140.0, 50.215657693321887), (99133296.0, 50.780409860456999), (99638983.0, 51.429566445052899), (100363597.0, 52.162105936757101), (101120519.0, 52.894471471541799), (101946256.0, 53.627174447338199), (102862489.0, 54.349653085382698), (103770134.0, 55.061127012228795), (104589313.0, 55.7886862473798), (105304312.0, 56.530668389657201), (105924838.0, 57.213134522150497), (106564905.0, 57.822931161135998), (107187982.0, 58.286690506739795), (107770794.0, 58.683563322897996), (108326895.0, 59.081567030459006), (108853181.0, 59.480212620603197), (109360296.0, 59.873735202774107), (109847148.0, 60.258086533789701), (110296680.0, 60.638613615994608), (110688533.0, 61.020488525916214), (110801380.0, 61.312199620203295), (110745760.0, 61.520994481657802), (110290445.0, 61.741539221625203), (110005636.0, 61.820287894431203), (110081461.0, 61.779410244518786), (110019570.0, 61.751130812191001), (109913216.0, 61.715962603505297), (109563097.0, 61.695812920403299), (109459093.0, 61.661656630381501), (109207205.0, 61.632890822478196), (109092730.0, 61.595078134840001), (108405522.0, 61.567439264127209), (107800399.0, 61.571020971323101), (107097577.0, 61.629953932533901), (106760768.0, 61.670694008505109), (106466116.0, 61.711606934595004), (106173766.0, 61.7605094345057), (105901322.0, 61.815522500550095), (105504531.0, 61.887634194914291), (105126686.0, 61.964992935380899), (104924372.0, 62.020159705764101), (104543801.0, 62.059416833265885), (104174038.0, 62.099517041078904), (103935318.0, 62.141847349338995), (103713726.0, 62.197640397588302), (103496179.0, 62.269282909458894), (103256779.0, 62.357002383678797)]
Plots 20 and 21 are not displayed as they are redundant with plot 22.
Urban Population by Year for Country Code CEB:
Urban Population by Year for Country Code ARB:
Chapter 1 - Introduction to Networks
Introduction to networks - examples like social networks, transportation networks, etc.:
Types of graphs:
Network visualization - irrational (“looks like a hairball”) and rational visualizations:
Example code includes:
## NEED TO MOCK UP T_sub from the above
import networkx as nx
import datetime
T_sub = nx.DiGraph()
T_sub.add_nodes_from([1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])
T_sub.add_edges_from([(1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8), (1, 9), (1, 10), (1, 11), (1, 12), (1, 13), (1, 14), (1, 15), (1, 16), (1, 17), (1, 18), (1, 19), (1, 20), (1, 21), (1, 22), (1, 23), (1, 24), (1, 25), (1, 26), (1, 27), (1, 28), (1, 29), (1, 30), (1, 31), (1, 32), (1, 33), (1, 34), (1, 35), (1, 36), (1, 37), (1, 38), (1, 39), (1, 40), (1, 41), (1, 42), (1, 43), (1, 44), (1, 45), (1, 46), (1, 47), (1, 48), (1, 49), (16, 48), (16, 18), (16, 35), (16, 36), (18, 16), (18, 24), (18, 35), (18, 36), (19, 35), (19, 36), (19, 5), (19, 8), (19, 11), (19, 13), (19, 15), (19, 48), (19, 17), (19, 20), (19, 21), (19, 24), (19, 37), (19, 30), (19, 31), (28, 1), (28, 5), (28, 7), (28, 8), (28, 11), (28, 14), (28, 15), (28, 17), (28, 20), (28, 21), (28, 24), (28, 25), (28, 27), (28, 29), (28, 30), (28, 31), (28, 35), (28, 36), (28, 37), (28, 44), (28, 48), (28, 49), (36, 24), (36, 35), (36, 5), (36, 37), (37, 24), (37, 35), (37, 36), (39, 1), (39, 35), (39, 36), (39, 38), (39, 33), (39, 40), (39, 41), (39, 45), (39, 24), (42, 1), (43, 48), (43, 35), (43, 36), (43, 37), (43, 24), (43, 29), (43, 47), (45, 1), (45, 39), (45, 41)])
node_meta = [{'occupation': 'scientist', 'category': 'I'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'P'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'P'}]
for x in range(len(T_sub.nodes())) :
T_sub.node[T_sub.nodes()[x]]["occupation"] = node_meta[x]["occupation"]
T_sub.node[T_sub.nodes()[x]]["category"] = node_meta[x]["category"]
edge_meta = [{'date': datetime.date(2012, 11, 17)}, {'date': datetime.date(2007, 6, 19)}, {'date': datetime.date(2014, 3, 18)}, {'date': datetime.date(2007, 3, 18)}, {'date': datetime.date(2011, 12, 19)}, {'date': datetime.date(2013, 12, 7)}, {'date': datetime.date(2009, 11, 9)}, {'date': datetime.date(2008, 10, 7)}, {'date': datetime.date(2008, 8, 14)}, {'date': datetime.date(2011, 3, 22)}, {'date': datetime.date(2014, 8, 3)}, {'date': datetime.date(2007, 5, 19)}, {'date': datetime.date(2009, 12, 13)}, {'date': datetime.date(2011, 4, 7)}, {'date': datetime.date(2013, 8, 2)}, {'date': datetime.date(2014, 11, 17)}, {'date': datetime.date(2013, 5, 20)}, {'date': datetime.date(2010, 12, 15)}, {'date': datetime.date(2010, 11, 27)}, {'date': datetime.date(2013, 9, 5)}, {'date': datetime.date(2013, 3, 1)}, {'date': datetime.date(2007, 7, 8)}, {'date': datetime.date(2010, 5, 23)}, {'date': datetime.date(2007, 9, 14)}, {'date': datetime.date(2013, 1, 24)}, {'date': datetime.date(2013, 6, 21)}, {'date': datetime.date(2010, 6, 28)}, {'date': datetime.date(2011, 12, 2)}, {'date': datetime.date(2010, 7, 24)}, {'date': datetime.date(2010, 7, 4)}, {'date': datetime.date(2013, 9, 28)}, {'date': datetime.date(2007, 3, 17)}, {'date': datetime.date(2013, 11, 7)}, {'date': datetime.date(2012, 8, 13)}, {'date': datetime.date(2009, 2, 19)}, {'date': datetime.date(2007, 3, 17)}, {'date': datetime.date(2011, 11, 15)}, {'date': datetime.date(2011, 12, 26)}, {'date': datetime.date(2010, 2, 14)}, {'date': datetime.date(2014, 4, 16)}, {'date': datetime.date(2010, 2, 28)}, {'date': datetime.date(2007, 11, 2)}, {'date': datetime.date(2008, 5, 17)}, {'date': datetime.date(2013, 11, 18)}, {'date': datetime.date(2010, 11, 14)}, {'date': datetime.date(2007, 8, 19)}, {'date': datetime.date(2012, 5, 11)}, {'date': datetime.date(2007, 10, 27)}, {'date': datetime.date(2009, 11, 14)}, {'date': datetime.date(2009, 4, 19)}, {'date': datetime.date(2007, 7, 14)}, {'date': datetime.date(2012, 5, 7)}, {'date': datetime.date(2014, 5, 4)}, {'date': datetime.date(2012, 6, 16)}, {'date': datetime.date(2012, 4, 25)}, {'date': datetime.date(2012, 6, 25)}, {'date': datetime.date(2010, 10, 14)}, {'date': datetime.date(2013, 4, 18)}, {'date': datetime.date(2013, 10, 6)}, {'date': datetime.date(2009, 8, 2)}, {'date': datetime.date(2008, 9, 23)}, {'date': datetime.date(2011, 11, 26)}, {'date': datetime.date(2010, 1, 22)}, {'date': datetime.date(2012, 6, 23)}, {'date': datetime.date(2013, 11, 20)}, {'date': datetime.date(2008, 7, 6)}, {'date': datetime.date(2009, 4, 12)}, {'date': datetime.date(2011, 12, 28)}, {'date': datetime.date(2012, 1, 22)}, {'date': datetime.date(2009, 1, 26)}, {'date': datetime.date(2012, 1, 13)}, {'date': datetime.date(2010, 9, 26)}, {'date': datetime.date(2013, 11, 14)}, {'date': datetime.date(2010, 7, 22)}, {'date': datetime.date(2013, 3, 17)}, {'date': datetime.date(2008, 10, 18)}, {'date': datetime.date(2008, 12, 9)}, {'date': datetime.date(2012, 1, 14)}, {'date': datetime.date(2012, 6, 28)}, {'date': datetime.date(2011, 10, 5)}, {'date': datetime.date(2007, 5, 19)}, {'date': datetime.date(2013, 1, 24)}, {'date': datetime.date(2008, 6, 28)}, {'date': datetime.date(2008, 5, 16)}, {'date': datetime.date(2013, 5, 8)}, {'date': datetime.date(2007, 7, 23)}, {'date': datetime.date(2010, 8, 4)}, {'date': datetime.date(2011, 10, 18)}, {'date': datetime.date(2011, 6, 2)}, {'date': datetime.date(2009, 5, 23)}, {'date': datetime.date(2010, 10, 14)}, {'date': datetime.date(2013, 7, 17)}, {'date': datetime.date(2008, 5, 19)}, {'date': datetime.date(2008, 3, 19)}, {'date': datetime.date(2010, 8, 14)}, {'date': datetime.date(2012, 6, 19)}, {'date': datetime.date(2013, 8, 12)}, {'date': datetime.date(2013, 7, 6)}, {'date': datetime.date(2014, 10, 11)}, {'date': datetime.date(2012, 7, 1)}, {'date': datetime.date(2013, 11, 5)}, {'date': datetime.date(2009, 11, 6)}, {'date': datetime.date(2009, 4, 19)}, {'date': datetime.date(2008, 8, 12)}, {'date': datetime.date(2012, 8, 8)}, {'date': datetime.date(2009, 8, 12)}, {'date': datetime.date(2012, 5, 27)}, {'date': datetime.date(2011, 9, 15)}, {'date': datetime.date(2013, 12, 19)}, {'date': datetime.date(2007, 12, 7)}, {'date': datetime.date(2008, 3, 4)}, {'date': datetime.date(2013, 9, 16)}, {'date': datetime.date(2009, 11, 22)}, {'date': datetime.date(2014, 9, 19)}, {'date': datetime.date(2008, 10, 20)}, {'date': datetime.date(2010, 12, 16)}, {'date': datetime.date(2013, 3, 15)}, {'date': datetime.date(2012, 4, 25)}, {'date': datetime.date(2009, 5, 10)}]
for x in range(len(T_sub.edges())) :
a, b = T_sub.edges()[x]
T_sub.edge[a][b]["date"] = edge_meta[x]["date"]
# Import necessary modules
import matplotlib.pyplot as plt
# Draw the graph to screen
nx.draw(T_sub)
# plt.show()
plt.savefig("_dummyPy024.png", bbox_inches="tight")
# Also need to mock up T
# Use T_sub for these
# Use a list comprehension to get the nodes of interest: noi
noi = [n for n, d in T_sub.nodes(data=True) if d['occupation'] == 'scientist']
# Use a list comprehension to get the edges of interest: eoi
eoi = [(u, v) for u, v, d in T_sub.edges(data=True) if d["date"] < datetime.date(2010, 1, 1)]
# Set the weight of the edge
T_sub.edge[1][10]["weight"] = 2
# Iterate over all the edges (with metadata)
for u, v, d in T_sub.edges(data=True):
# Check if node 293 is involved
# Make it node 23 instead
if 23 in [u, v]:
# Set the weight to 1.1
T_sub.edge[u][v]["weight"] = 1.1
# Define find_selfloop_nodes()
def find_selfloop_nodes(G):
"""
Finds all nodes that have self-loops in the graph G.
"""
nodes_in_selfloops = []
# Iterate over all the edges of G
for u, v in G.edges():
# Check if node u and node v are the same
if u == v:
# Append node u to nodes_in_selfloops
nodes_in_selfloops.append(u)
return nodes_in_selfloops
# Check whether number of self loops equals the number of nodes in self loops
# The mock-up above has no self-loops, so this is just for reference on how to find them
assert T_sub.number_of_selfloops() == len(find_selfloop_nodes(T_sub))
# Import nxviz
import nxviz as nv
# Create the MatrixPlot object: m
m = nv.MatrixPlot(T_sub)
# Draw m to the screen
m.draw()
# Display the plot
# plt.show()
plt.savefig("_dummyPy025.png", bbox_inches="tight")
# Convert T to a matrix format: A
A = nx.to_numpy_matrix(T_sub)
# Convert A back to the NetworkX form as a directed graph: T_conv
T_conv = nx.from_numpy_matrix(A, create_using=nx.DiGraph())
# Check that the `category` metadata field is lost from each node
for n, d in T_conv.nodes(data=True):
assert 'category' not in d.keys()
# Import necessary modules
import matplotlib.pyplot as plt
from nxviz import CircosPlot
# Create the CircosPlot object: c
c = CircosPlot(T_sub)
# Draw c to the screen
c.draw()
# Display the plot
# plt.show()
plt.savefig("_dummyPy026.png", bbox_inches="tight")
# Import necessary modules
from nxviz import ArcPlot
# Create the un-customized ArcPlot object: a
a = ArcPlot(T_sub)
# Draw a to the screen
a.draw()
# Display the plot
# plt.show()
plt.savefig("_dummyPy027.png", bbox_inches="tight")
# Create the customized ArcPlot object: a2
a2 = ArcPlot(T_sub, node_order="category", node_color="category")
# Draw a2 to the screen
a2.draw()
# Display the plot
# plt.show()
plt.savefig("_dummyPy028.png", bbox_inches="tight")
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\networkx\drawing\nx_pylab.py:126: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
## Future behavior will be consistent with the long-time default:
## plot commands add elements without first clearing the
## Axes and/or Figure.
## b = plt.ishold()
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\networkx\drawing\nx_pylab.py:138: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
## Future behavior will be consistent with the long-time default:
## plot commands add elements without first clearing the
## Axes and/or Figure.
## plt.hold(b)
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\matplotlib\__init__.py:917: UserWarning: axes.hold is deprecated. Please remove it from your matplotlibrc and/or style files.
## warnings.warn(self.msg_depr_set % key)
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\matplotlib\rcsetup.py:152: UserWarning: axes.hold is deprecated, will be removed in 3.0
## warnings.warn("axes.hold is deprecated, will be removed in 3.0")
Example network plot:
Example MatrixPlot (network):
Example CircosPlot (network):
Example ArcPlot (network):
Example ArcPlot (network) colored by category:
Chapter 2 - Important Nodes
Degree centrality - one method of determining important nodes:
Graph algorithms - path finding for optimization (e.g., shortest path between nodes, information or disease spread, etc.):
Betweeness centrality - including the key concept of “all shortest paths”:
Example code includes:
import networkx as nx
import matplotlib.pyplot as plt
import datetime
# DO NOT HAVE Graph T
# Make the same as above
T = nx.DiGraph()
T.add_nodes_from([1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])
T.add_edges_from([(1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8), (1, 9), (1, 10), (1, 11), (1, 12), (1, 13), (1, 14), (1, 15), (1, 16), (1, 17), (1, 18), (1, 19), (1, 20), (1, 21), (1, 22), (1, 23), (1, 24), (1, 25), (1, 26), (1, 27), (1, 28), (1, 29), (1, 30), (1, 31), (1, 32), (1, 33), (1, 34), (1, 35), (1, 36), (1, 37), (1, 38), (1, 39), (1, 40), (1, 41), (1, 42), (1, 43), (1, 44), (1, 45), (1, 46), (1, 47), (1, 48), (1, 49), (16, 48), (16, 18), (16, 35), (16, 36), (18, 16), (18, 24), (18, 35), (18, 36), (19, 35), (19, 36), (19, 5), (19, 8), (19, 11), (19, 13), (19, 15), (19, 48), (19, 17), (19, 20), (19, 21), (19, 24), (19, 37), (19, 30), (19, 31), (28, 1), (28, 5), (28, 7), (28, 8), (28, 11), (28, 14), (28, 15), (28, 17), (28, 20), (28, 21), (28, 24), (28, 25), (28, 27), (28, 29), (28, 30), (28, 31), (28, 35), (28, 36), (28, 37), (28, 44), (28, 48), (28, 49), (36, 24), (36, 35), (36, 5), (36, 37), (37, 24), (37, 35), (37, 36), (39, 1), (39, 35), (39, 36), (39, 38), (39, 33), (39, 40), (39, 41), (39, 45), (39, 24), (42, 1), (43, 48), (43, 35), (43, 36), (43, 37), (43, 24), (43, 29), (43, 47), (45, 1), (45, 39), (45, 41)])
node_meta = [{'occupation': 'scientist', 'category': 'I'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'P'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'P'}]
for x in range(len(T.nodes())) :
T.node[T.nodes()[x]]["occupation"] = node_meta[x]["occupation"]
T.node[T.nodes()[x]]["category"] = node_meta[x]["category"]
edge_meta = [{'date': datetime.date(2012, 11, 17)}, {'date': datetime.date(2007, 6, 19)}, {'date': datetime.date(2014, 3, 18)}, {'date': datetime.date(2007, 3, 18)}, {'date': datetime.date(2011, 12, 19)}, {'date': datetime.date(2013, 12, 7)}, {'date': datetime.date(2009, 11, 9)}, {'date': datetime.date(2008, 10, 7)}, {'date': datetime.date(2008, 8, 14)}, {'date': datetime.date(2011, 3, 22)}, {'date': datetime.date(2014, 8, 3)}, {'date': datetime.date(2007, 5, 19)}, {'date': datetime.date(2009, 12, 13)}, {'date': datetime.date(2011, 4, 7)}, {'date': datetime.date(2013, 8, 2)}, {'date': datetime.date(2014, 11, 17)}, {'date': datetime.date(2013, 5, 20)}, {'date': datetime.date(2010, 12, 15)}, {'date': datetime.date(2010, 11, 27)}, {'date': datetime.date(2013, 9, 5)}, {'date': datetime.date(2013, 3, 1)}, {'date': datetime.date(2007, 7, 8)}, {'date': datetime.date(2010, 5, 23)}, {'date': datetime.date(2007, 9, 14)}, {'date': datetime.date(2013, 1, 24)}, {'date': datetime.date(2013, 6, 21)}, {'date': datetime.date(2010, 6, 28)}, {'date': datetime.date(2011, 12, 2)}, {'date': datetime.date(2010, 7, 24)}, {'date': datetime.date(2010, 7, 4)}, {'date': datetime.date(2013, 9, 28)}, {'date': datetime.date(2007, 3, 17)}, {'date': datetime.date(2013, 11, 7)}, {'date': datetime.date(2012, 8, 13)}, {'date': datetime.date(2009, 2, 19)}, {'date': datetime.date(2007, 3, 17)}, {'date': datetime.date(2011, 11, 15)}, {'date': datetime.date(2011, 12, 26)}, {'date': datetime.date(2010, 2, 14)}, {'date': datetime.date(2014, 4, 16)}, {'date': datetime.date(2010, 2, 28)}, {'date': datetime.date(2007, 11, 2)}, {'date': datetime.date(2008, 5, 17)}, {'date': datetime.date(2013, 11, 18)}, {'date': datetime.date(2010, 11, 14)}, {'date': datetime.date(2007, 8, 19)}, {'date': datetime.date(2012, 5, 11)}, {'date': datetime.date(2007, 10, 27)}, {'date': datetime.date(2009, 11, 14)}, {'date': datetime.date(2009, 4, 19)}, {'date': datetime.date(2007, 7, 14)}, {'date': datetime.date(2012, 5, 7)}, {'date': datetime.date(2014, 5, 4)}, {'date': datetime.date(2012, 6, 16)}, {'date': datetime.date(2012, 4, 25)}, {'date': datetime.date(2012, 6, 25)}, {'date': datetime.date(2010, 10, 14)}, {'date': datetime.date(2013, 4, 18)}, {'date': datetime.date(2013, 10, 6)}, {'date': datetime.date(2009, 8, 2)}, {'date': datetime.date(2008, 9, 23)}, {'date': datetime.date(2011, 11, 26)}, {'date': datetime.date(2010, 1, 22)}, {'date': datetime.date(2012, 6, 23)}, {'date': datetime.date(2013, 11, 20)}, {'date': datetime.date(2008, 7, 6)}, {'date': datetime.date(2009, 4, 12)}, {'date': datetime.date(2011, 12, 28)}, {'date': datetime.date(2012, 1, 22)}, {'date': datetime.date(2009, 1, 26)}, {'date': datetime.date(2012, 1, 13)}, {'date': datetime.date(2010, 9, 26)}, {'date': datetime.date(2013, 11, 14)}, {'date': datetime.date(2010, 7, 22)}, {'date': datetime.date(2013, 3, 17)}, {'date': datetime.date(2008, 10, 18)}, {'date': datetime.date(2008, 12, 9)}, {'date': datetime.date(2012, 1, 14)}, {'date': datetime.date(2012, 6, 28)}, {'date': datetime.date(2011, 10, 5)}, {'date': datetime.date(2007, 5, 19)}, {'date': datetime.date(2013, 1, 24)}, {'date': datetime.date(2008, 6, 28)}, {'date': datetime.date(2008, 5, 16)}, {'date': datetime.date(2013, 5, 8)}, {'date': datetime.date(2007, 7, 23)}, {'date': datetime.date(2010, 8, 4)}, {'date': datetime.date(2011, 10, 18)}, {'date': datetime.date(2011, 6, 2)}, {'date': datetime.date(2009, 5, 23)}, {'date': datetime.date(2010, 10, 14)}, {'date': datetime.date(2013, 7, 17)}, {'date': datetime.date(2008, 5, 19)}, {'date': datetime.date(2008, 3, 19)}, {'date': datetime.date(2010, 8, 14)}, {'date': datetime.date(2012, 6, 19)}, {'date': datetime.date(2013, 8, 12)}, {'date': datetime.date(2013, 7, 6)}, {'date': datetime.date(2014, 10, 11)}, {'date': datetime.date(2012, 7, 1)}, {'date': datetime.date(2013, 11, 5)}, {'date': datetime.date(2009, 11, 6)}, {'date': datetime.date(2009, 4, 19)}, {'date': datetime.date(2008, 8, 12)}, {'date': datetime.date(2012, 8, 8)}, {'date': datetime.date(2009, 8, 12)}, {'date': datetime.date(2012, 5, 27)}, {'date': datetime.date(2011, 9, 15)}, {'date': datetime.date(2013, 12, 19)}, {'date': datetime.date(2007, 12, 7)}, {'date': datetime.date(2008, 3, 4)}, {'date': datetime.date(2013, 9, 16)}, {'date': datetime.date(2009, 11, 22)}, {'date': datetime.date(2014, 9, 19)}, {'date': datetime.date(2008, 10, 20)}, {'date': datetime.date(2010, 12, 16)}, {'date': datetime.date(2013, 3, 15)}, {'date': datetime.date(2012, 4, 25)}, {'date': datetime.date(2009, 5, 10)}]
for x in range(len(T.edges())) :
a, b = T.edges()[x]
T.edge[a][b]["date"] = edge_meta[x]["date"]
# Define nodes_with_m_nbrs()
def nodes_with_m_nbrs(G, m):
"""
Returns all nodes in graph G that have m neighbors.
"""
nodes = set()
# Iterate over all nodes in G
for n in G.nodes():
# Check if the number of neighbors of n matches m
if len(G.neighbors(n)) == m:
# Add the node n to the set
nodes.add(n)
# Return the nodes with m neighbors
return nodes
# Compute and print all nodes in T that have 3 neighbors
three_nbrs = nodes_with_m_nbrs(T, 3)
print(three_nbrs)
# Compute the degree of every node: degrees
degrees = [len(T.neighbors(n)) for n in T.nodes()]
# Print the degrees
print(degrees)
# Compute the degree centrality of the Twitter network: deg_cent
deg_cent = nx.degree_centrality(T)
# Plot a histogram of the degree centrality distribution of the graph.
plt.figure()
plt.hist(list(deg_cent.values()))
# plt.show()
plt.savefig("_dummyPy029.png", bbox_inches="tight")
plt.clf()
# Plot a histogram of the degree distribution of the graph
plt.figure()
plt.hist(degrees)
# plt.show()
plt.savefig("_dummyPy030.png", bbox_inches="tight")
plt.clf()
# Plot a scatter plot of the centrality distribution and the degree distribution
plt.figure()
plt.scatter(degrees, list(deg_cent.values()))
# plt.show()
plt.savefig("_dummyPy031.png", bbox_inches="tight")
plt.clf()
def path_exists(G, node1, node2):
"""
This function checks whether a path exists between two nodes (node1, node2) in graph G.
"""
visited_nodes = set()
queue = [node1]
for node in queue:
neighbors = G.neighbors(node)
if node2 in neighbors:
print('Path exists between nodes {0} and {1}'.format(node1, node2))
return True
break
else:
visited_nodes.add(node)
queue.extend([n for n in neighbors if n not in visited_nodes])
# Check to see if the final element of the queue has been reached
if node == queue[-1]:
print('Path does not exist between nodes {0} and {1}'.format(node1, node2))
# Place the appropriate return statement
return False
# Compute the betweenness centrality of T: bet_cen
bet_cen = nx.betweenness_centrality(T)
# Compute the degree centrality of T: deg_cen
deg_cen = nx.degree_centrality(T)
# Create a scatter plot of betweenness centrality and degree centrality
plt.scatter(list(bet_cen.values()), list(deg_cen.values()))
# Display the plot
# plt.show()
plt.savefig("_dummyPy032.png", bbox_inches="tight")
plt.clf()
# Define find_nodes_with_highest_deg_cent()
def find_nodes_with_highest_deg_cent(G):
# Compute the degree centrality of G: deg_cent
deg_cent = nx.degree_centrality(G)
# Compute the maximum degree centrality: max_dc
max_dc = max(list(deg_cent.values()))
nodes = set()
# Iterate over the degree centrality dictionary
for k, v in deg_cent.items():
# Check if the current value has the maximum degree centrality
if v == max_dc:
# Add the current node to the set of nodes
nodes.add(k)
return nodes
# Find the node(s) that has the highest degree centrality in T: top_dc
top_dc = find_nodes_with_highest_deg_cent(T)
print(top_dc)
# Write the assertion statement
for node in top_dc:
assert nx.degree_centrality(T)[node] == max(nx.degree_centrality(T).values())
# Define find_node_with_highest_bet_cent()
def find_node_with_highest_bet_cent(G):
# Compute betweenness centrality: bet_cent
bet_cent = nx.betweenness_centrality(G)
# Compute maximum betweenness centrality: max_bc
max_bc = max(list(bet_cent.values()))
nodes = set()
# Iterate over the betweenness centrality dictionary
for k, v in bet_cent.items():
# Check if the current value has the maximum betweenness centrality
if v == max_bc:
# Add the current node to the set of nodes
nodes.add(k)
return nodes
# Use that function to find the node(s) that has the highest betweenness centrality in the network: top_bc
top_bc = find_node_with_highest_bet_cent(T)
print(top_bc)
# Write an assertion statement that checks that the node(s) is/are correctly identified.
for node in top_bc:
assert nx.betweenness_centrality(T)[node] == max(nx.betweenness_centrality(T).values())
## {45, 37}
## [47, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 4, 15, 0, 0, 0, 0, 0, 0, 0, 0, 22, 0, 0, 0, 0, 0, 0, 0, 4, 3, 0, 9, 0, 0, 1, 7, 0, 3, 0, 0, 0, 0]
## {1}
## {1}
Histogram of degree centrality:
Histogram of degree distribution:
Scatter plot of degree centrality vs degree distribution:
Scatter plot of degree centrality vs between centrality:
Chapter 3 - Structures
Cliques and communities - idea of tightly-knit groups:
Maximal cliques - defined as a clique that when expanded by one node is no longer a clique:
Sub-graphs - sometimes helpful to view just a small portion of a larger graph:
Example code includes:
from itertools import combinations
# Define is_in_triangle()
def is_in_triangle(G, n):
"""
Checks whether a node `n` in graph `G` is in a triangle relationship or not.
Returns a boolean.
"""
in_triangle = False
# Iterate over all possible triangle relationship combinations
for n1, n2 in combinations(G.neighbors(n), 2):
# Check if an edge exists between n1 and n2
if G.has_edge(n1, n2):
in_triangle = True
break
return in_triangle
# DO NOT HAVE T (make randomly, minus metadata)
import networkx as nx
import random
import numpy as np
import matplotlib.pyplot as plt
T = nx.Graph()
T.add_nodes_from([x for x in range(1, 31)])
np.random.seed(170530)
n1 = np.random.choice(range(1, 31), size=100, replace=True)
n2 = np.random.choice(range(1, 31), size=100, replace=True)
# Require that first be less than second
edge_list = [(min(x, y), max(x, y)) for x, y in zip(n1, n2) if x != y]
T.add_edges_from(edge_list)
# APPEARS THAT the set() makes sure to keep a sorted and unique list; if a = set(1, 2) and a.add(1) is run, than a will still be {1, 2}
# Can remove items from the set using a.remove() and can add items to the set using a.add()
# Write a function that identifies all nodes in a triangle relationship with a given node.
def nodes_in_triangle(G, n):
"""
Returns the nodes in a graph `G` that are involved in a triangle relationship with the node `n`.
"""
triangle_nodes = set([n])
# Iterate over all possible triangle relationship combinations
for n1, n2 in combinations(G.neighbors(n), 2):
# Check if n1 and n2 have an edge between them
if G.has_edge(n1, n2):
# Add n1 to triangle_nodes
triangle_nodes.add(n1)
# Add n2 to triangle_nodes
triangle_nodes.add(n2)
return triangle_nodes
# Write the assertion statement
assert len(nodes_in_triangle(T, 1)) == 5 # happens to be what the RNG generated in this case
# Define node_in_open_triangle()
def node_in_open_triangle(G, n):
"""
Checks whether pairs of neighbors of node `n` in graph `G` are in an 'open triangle' relationship with node `n`.
"""
in_open_triangle = False
# Iterate over all possible triangle relationship combinations
for n1, n2 in combinations(G.neighbors(n), 2):
# Check if n1 and n2 do NOT have an edge between them
if not G.has_edge(n1, n2):
in_open_triangle = True
break
return in_open_triangle
# Compute the number of open triangles in T
num_open_triangles = 0
# Iterate over all the nodes in T
for n in T.nodes():
# Check if the current node is in an open triangle
if node_in_open_triangle(T, n):
# Increment num_open_triangles
num_open_triangles += 1
print(num_open_triangles)
# Define maximal_cliques()
def maximal_cliques(G, size):
"""
Finds all maximal cliques in graph `G` that are of size `size`.
"""
mcs = []
for clique in nx.find_cliques(G):
if len(clique) == size:
mcs.append(clique)
return mcs
# Check that there are 33 maximal cliques of size 3 in the graph T
assert len(maximal_cliques(T, 3)) == 26 # happens to be what the RNG returns in this case
# Define get_nodes_and_nbrs()
def get_nodes_and_nbrs(G, nodes_of_interest):
"""
Returns a subgraph of the graph `G` with only the `nodes_of_interest` and their neighbors.
"""
nodes_to_draw = []
# Iterate over the nodes of interest
for n in nodes_of_interest:
# Append the nodes of interest to nodes_to_draw
nodes_to_draw.append(n)
# Iterate over all the neighbors of node n
for nbr in G.neighbors(n):
# Append the neighbors of n to nodes_to_draw
nodes_to_draw.append(nbr)
return G.subgraph(nodes_to_draw)
# Extract the subgraph with the nodes of interest: T_draw
nodes_of_interest = [8, 24, 26]
T_draw = get_nodes_and_nbrs(T, nodes_of_interest)
# Draw the subgraph to the screen
nx.draw(T_draw, with_labels=True)
# plt.show()
plt.savefig("_dummyPy033.png", bbox_inches="tight")
# Extract the nodes of interest: nodes
node_meta = [{'occupation': 'scientist', 'category': 'I'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'scientist', 'category': 'P'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'politician', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'scientist', 'category': 'D'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'celebrity', 'category': 'D'}, {'occupation': 'politician', 'category': 'D'}, {'occupation': 'politician', 'category': 'P'}, {'occupation': 'celebrity', 'category': 'I'}, {'occupation': 'celebrity', 'category': 'P'}, {'occupation': 'scientist', 'category': 'I'}, {'occupation': 'scientist', 'category': 'P'}]
for x in range(len(T.nodes())) :
T.node[T.nodes()[x]]["occupation"] = node_meta[x]["occupation"]
T.node[T.nodes()[x]]["category"] = node_meta[x]["category"]
nodes = [n for n, d in T.nodes(data=True) if d['occupation'] == 'celebrity']
# Create the set of nodes: nodeset
nodeset = set(nodes)
# Iterate over nodes
for n in nodeset:
# Compute the neighbors of n: nbrs
nbrs = T.neighbors(n)
# Compute the union of nodeset and nbrs: nodeset
nodeset = nodeset.union(nbrs)
# Compute the subgraph using nodeset: T_sub
T_sub = T.subgraph(nodeset)
# Draw T_sub to the screen
nx.draw(T_sub, with_labels=True)
# plt.show()
plt.savefig("_dummyPy034.png", bbox_inches="tight")
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\networkx\drawing\nx_pylab.py:126: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
## Future behavior will be consistent with the long-time default:
## plot commands add elements without first clearing the
## Axes and/or Figure.
## b = plt.ishold()
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\networkx\drawing\nx_pylab.py:138: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
## Future behavior will be consistent with the long-time default:
## plot commands add elements without first clearing the
## Axes and/or Figure.
## plt.hold(b)
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\matplotlib\__init__.py:917: UserWarning: axes.hold is deprecated. Please remove it from your matplotlibrc and/or style files.
## warnings.warn(self.msg_depr_set % key)
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\matplotlib\rcsetup.py:152: UserWarning: axes.hold is deprecated, will be removed in 3.0
## warnings.warn("axes.hold is deprecated, will be removed in 3.0")
## 30
Example Sub-graph (anything touching any of [8, 24, 26]:
Example Sub-graph (specified “occupation” in metadata):
Chapter 4 - Case Study
Case study introduction - GitHub collaborator data:
Case Study Part II - Visualization using the nxviz API:
Case Study Part III: Cliques:
Case Study Part IV: Additional Tasks (building a recommender):
Example code includes:
# Import necessary modules
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import random
# DO NOT HAVE Github collaborator graph "G"
# Dummy up the data - 20 each of 2 "flavors"
G = nx.Graph()
G.add_nodes_from([x for x in range(1, 41)])
np.random.seed(170531)
# Add edges for 1-20 with preference that they match to themselves
n1 = np.random.choice(range(1, 21), size=100, replace=True)
n2 = np.random.choice(range(1, 21), size=90, replace=True)
n3 = np.random.choice(range(21, 41), size=10, replace=True)
# Require that first be less than second
edge_list = [(min(x, y), max(x, y)) for x, y in zip(n1, np.append(n2, n3)) if x != y]
G.add_edges_from(edge_list)
# Add edges for 21-40 with preference that they match to themselves
n1 = np.random.choice(range(21, 41), size=50, replace=True)
n2 = np.random.choice(range(21, 41), size=40, replace=True)
n3 = np.random.choice(range(1, 21), size=10, replace=True)
# Require that first be less than second
edge_list = [(min(x, y), max(x, y)) for x, y in zip(n1, np.append(n2, n3)) if x != y]
G.add_edges_from(edge_list)
# Create two groupings for the nodes
node_meta = [{'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type01'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}, {'grouping': 'type02'}]
for x in range(len(G.nodes())) :
G.node[G.nodes()[x]]["grouping"] = node_meta[x]["grouping"]
# Plot the degree distribution of the GitHub collaboration network
plt.hist(list(nx.degree_centrality(G).values()))
# plt.show()
plt.savefig("_dummyPy035.png", bbox_inches="tight")
plt.clf()
# Plot the degree distribution of the GitHub collaboration network
plt.hist(list(nx.betweenness_centrality(G).values()))
# plt.show()
plt.savefig("_dummyPy036.png", bbox_inches="tight")
plt.clf()
# Import necessary modules
from nxviz import MatrixPlot
# Calculate the largest connected component subgraph: largest_ccs
largest_ccs = sorted(nx.connected_component_subgraphs(G), key=lambda x: len(x))[-1]
# Create the customized MatrixPlot object: h
h = MatrixPlot(largest_ccs, node_grouping="grouping")
# Draw the MatrixPlot to the screen
h.draw()
# plt.show()
plt.savefig("_dummyPy037.png", bbox_inches="tight")
# Import necessary modules
from nxviz.plots import ArcPlot
# Iterate over all the nodes in G, including the metadata
for n, d in G.nodes(data=True):
# Calculate the degree of each node: G.node[n]['degree']
G.node[n]['degree'] = nx.degree(G, n)
# Create the ArcPlot object: a
a = ArcPlot(G, node_order="degree")
# Draw the ArcPlot to the screen
a.draw()
# plt.show()
plt.savefig("_dummyPy038.png", bbox_inches="tight")
# Import necessary modules
from nxviz import CircosPlot
# Iterate over all the nodes, including the metadata
for n, d in G.nodes(data=True):
# Calculate the degree of each node: G.node[n]['degree']
G.node[n]['degree'] = nx.degree(G, n)
# Create the CircosPlot object: c
c = CircosPlot(G, node_order="degree", node_grouping="grouping", node_color="grouping")
# Draw the CircosPlot object to the screen
c.draw()
# plt.show()
plt.savefig("_dummyPy039.png", bbox_inches="tight")
# Calculate the maximal cliques in G: cliques
cliques = nx.find_cliques(G)
# Count and print the number of maximal cliques in G
print(len(list(cliques)))
# Find the author(s) that are part of the largest maximal clique: largest_clique
largest_clique = sorted(nx.find_cliques(G), key=lambda x:len(x))[-1]
# Create the subgraph of the largest_clique: G_lc
G_lc = G.subgraph(largest_clique)
# Create the CircosPlot object: c
c = CircosPlot(G_lc)
# Draw the CircosPlot to the screen
c.draw()
# plt.show()
plt.savefig("_dummyPy040.png", bbox_inches="tight")
# Compute the degree centralities of G: deg_cent
deg_cent = nx.degree_centrality(G)
# Compute the maximum degree centrality: max_dc
max_dc = max(deg_cent.values())
# Find the user(s) that have collaborated the most: prolific_collaborators
prolific_collaborators = [n for n, dc in deg_cent.items() if dc == max_dc]
# Print the most prolific collaborator(s)
print(prolific_collaborators)
# Identify the largest maximal clique: largest_max_clique
largest_max_clique = set(sorted(nx.find_cliques(G), key=lambda x: len(x))[-1])
# Create a subgraph from the largest_max_clique: G_lmc
G_lmc = G.subgraph(largest_max_clique)
# Go out 1 degree of separation
for node in G_lmc.nodes():
G_lmc.add_nodes_from(G.neighbors(node))
G_lmc.add_edges_from(zip([node]*len(G.neighbors(node)), G.neighbors(node)))
# Record each node's degree centrality score
for n in G_lmc.nodes():
G_lmc.node[n]['degree centrality'] = nx.degree_centrality(G_lmc)[n]
# Create the ArcPlot object: a
a = ArcPlot(G_lmc, node_order = "degree centrality")
# Draw the ArcPlot to the screen
a.draw()
# plt.show()
plt.savefig("_dummyPy041.png", bbox_inches="tight")
# Import necessary modules
from itertools import combinations
from collections import defaultdict
# Initialize the defaultdict: recommended
recommended = defaultdict(int)
# Iterate over all the nodes in G
for n, d in G.nodes(data=True):
# Iterate over all possible triangle relationship combinations
for n1, n2 in combinations(G.neighbors(n), 2):
# Check whether n1 and n2 do not have an edge
if not G.has_edge(n1, n2):
# Increment recommended
recommended[(n1, n2)] += 1
# Identify the top 10 pairs of users
all_counts = sorted(recommended.values())
top10_pairs = [pair for pair, count in recommended.items() if count > all_counts[-10]]
print(top10_pairs)
## 75
## [6]
## [(3, 5), (6, 8), (18, 1), (6, 2)]
Case study - degree distribution:
Case study - betweenness centrality:
Case study - MatrixPlot:
Case study - ArcPlot:
Case study - CircosPlot:
Case Study - CircosPlot (for largest clique):
Case Study - ArcPlot (ordered by degree centrality):
Chapter 1 - Introduction and flat files
Welcome to the course - importing from 1) flat files, 2) other native data, and 3) relational databases:
The importance of flat files in data science:
Importing flat files using numpy (only for data that is purely numerical):
Importing flat files using pandas - create 2-D data structures with columns of different data types:
Example code includes:
# put in directory ./PythonInputFiles/
# moby_dick.txt (converted to romeo-full.txt)
# digits.csv (using mnist_test.csv)
# digits_header.txt (skipped)
# seaslug.txt (downloaded)
# titanic.csv (converted from R)
# titanic_corrupt.txt (skipped)
myPath = "./PythonInputFiles/"
# NEED FILE "moby_dick.txt" (used "romeo-full.txt" instead)
# Open a file: file
file = open(myPath + "romeo-full.txt", mode="r")
# Print it
print(file.read())
# Check whether file is closed
print(file.closed)
# Close file
file.close()
# Check whether file is closed
print(file.closed)
# Read & print the first 3 lines
with open(myPath + "romeo-full.txt") as file:
print(file.readline())
print(file.readline())
print(file.readline())
# NEED DIGIT RECOGNITION SITE - see http://yann.lecun.com/exdb/mnist/
# Import package
import numpy as np
# Assign filename to variable: file
file = myPath + 'mnist_test.csv'
# Load file as array: digits
digits = np.loadtxt(file, delimiter=",")
# Print datatype of digits
print(type(digits))
# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))
import matplotlib.pyplot as plt # so the plotting below can be done
# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
# plt.show()
plt.savefig("_dummyPy042.png", bbox_inches="tight")
plt.clf()
# File should be tab-delimited and with a header row (for the skiprows=1)
# Assign the filename: file
# file = 'digits_header.txt'
# Load the data: data
# data = np.loadtxt(file, delimiter="\t", skiprows=1, usecols=[0, 2])
# Print data
# print(data)
# NEED FILE FROM http://www.stat.ucla.edu/projects/datasets/seaslug-explanation.html
# Should be floats with a single text header row, and tab-delimited
# Assign filename: file
file = myPath + 'seaslug.txt'
# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)
# Print the first element of data
print(data[0])
# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter="\t", dtype=float, skiprows=1)
# Print the 10th element of data_float
print(data_float[9])
# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
# plt.show()
plt.savefig("_dummyPy043.png", bbox_inches="tight")
plt.clf()
# NEED FILE "titanic.csv"
# Idea is that np.genfromtxt() and np.recfromcsv() can accept mixed data types through making each row its own array; dtype=None lets Python pick the data type by column
# Assign the filename: file
# file = myPath + 'titanic.csv'
# Import file using np.recfromcsv: d
# d=np.recfromcsv(file) # This is like np.genfromtxt() with defaults set to dtype=None, delimiter=",", names=True
# Print out first three entries of d
# print(d[:3])
# PassengerId-Survived-Pclass-Sex-Age-SibSp-Parch-Ticket-Fare-Cabin-Embarked
# Import pandas as pd
import pandas as pd
# Assign the filename: file
file = myPath + 'titanic.csv'
# Read the file into a DataFrame: df
df = pd.read_csv(file)
# View the head of the DataFrame
print(df.head())
# Assign the filename: file
file = myPath + 'mnist_test.csv'
# Read the first 5 rows of the file into a DataFrame: data
data=pd.read_csv(file, nrows=5, header=None)
# Build a numpy array from the DataFrame: data_array
data_array = data.values
# Print the datatype of data_array to the shell
print(type(data_array))
# Assign filename: file
# file = 'titanic_corrupt.txt'
# Import file: data
# data = pd.read_csv(file, sep="\t", comment="#", na_values=["Nothing"])
# Print the head of the DataFrame
# print(data.head())
# Plot 'Age' variable in a histogram
# pd.DataFrame.hist(data[['Age']])
# plt.xlabel('Age (years)')
# plt.ylabel('count')
# plt.show()
## Romeo and Juliet
## Act 2, Scene 2
##
## SCENE II. Capulet's orchard.
##
## Enter ROMEO
##
## ROMEO
##
## He jests at scars that never felt a wound.
## JULIET appears above at a window
##
## But, soft! what light through yonder window breaks?
## It is the east, and Juliet is the sun.
## Arise, fair sun, and kill the envious moon,
## Who is already sick and pale with grief,
## That thou her maid art far more fair than she:
## Be not her maid, since she is envious;
## Her vestal livery is but sick and green
## And none but fools do wear it; cast it off.
## It is my lady, O, it is my love!
## O, that she knew she were!
## She speaks yet she says nothing: what of that?
## Her eye discourses; I will answer it.
## I am too bold, 'tis not to me she speaks:
## Two of the fairest stars in all the heaven,
## Having some business, do entreat her eyes
## To twinkle in their spheres till they return.
## What if her eyes were there, they in her head?
## The brightness of her cheek would shame those stars,
## As daylight doth a lamp; her eyes in heaven
## Would through the airy region stream so bright
## That birds would sing and think it were not night.
## See, how she leans her cheek upon her hand!
## O, that I were a glove upon that hand,
## That I might touch that cheek!
##
## JULIET
##
## Ay me!
##
## ROMEO
##
## She speaks:
## O, speak again, bright angel! for thou art
## As glorious to this night, being o'er my head
## As is a winged messenger of heaven
## Unto the white-upturned wondering eyes
## Of mortals that fall back to gaze on him
## When he bestrides the lazy-pacing clouds
## And sails upon the bosom of the air.
##
## JULIET
##
## O Romeo, Romeo! wherefore art thou Romeo?
## Deny thy father and refuse thy name;
## Or, if thou wilt not, be but sworn my love,
## And I'll no longer be a Capulet.
##
## ROMEO
##
## [Aside] Shall I hear more, or shall I speak at this?
##
## JULIET
##
## 'Tis but thy name that is my enemy;
## Thou art thyself, though not a Montague.
## What's Montague? it is nor hand, nor foot,
## Nor arm, nor face, nor any other part
## Belonging to a man. O, be some other name!
## What's in a name? that which we call a rose
## By any other name would smell as sweet;
## So Romeo would, were he not Romeo call'd,
## Retain that dear perfection which he owes
## Without that title. Romeo, doff thy name,
## And for that name which is no part of thee
## Take all myself.
##
## ROMEO
##
## I take thee at thy word:
## Call me but love, and I'll be new baptized;
## Henceforth I never will be Romeo.
##
## JULIET
##
## What man art thou that thus bescreen'd in night
## So stumblest on my counsel?
##
## ROMEO
##
## By a name
## I know not how to tell thee who I am:
## My name, dear saint, is hateful to myself,
## Because it is an enemy to thee;
## Had I it written, I would tear the word.
##
## JULIET
##
## My ears have not yet drunk a hundred words
## Of that tongue's utterance, yet I know the sound:
## Art thou not Romeo and a Montague?
##
## ROMEO
##
## Neither, fair saint, if either thee dislike.
##
## JULIET
##
## How camest thou hither, tell me, and wherefore?
## The orchard walls are high and hard to climb,
## And the place death, considering who thou art,
## If any of my kinsmen find thee here.
##
## ROMEO
##
## With love's light wings did I o'er-perch these walls;
## For stony limits cannot hold love out,
## And what love can do that dares love attempt;
## Therefore thy kinsmen are no let to me.
##
## JULIET
##
## If they do see thee, they will murder thee.
##
## ROMEO
##
## Alack, there lies more peril in thine eye
## Than twenty of their swords: look thou but sweet,
## And I am proof against their enmity.
##
## JULIET
##
## I would not for the world they saw thee here.
##
## ROMEO
##
## I have night's cloak to hide me from their sight;
## And but thou love me, let them find me here:
## My life were better ended by their hate,
## Than death prorogued, wanting of thy love.
##
## JULIET
##
## By whose direction found'st thou out this place?
##
## ROMEO
##
## By love, who first did prompt me to inquire;
## He lent me counsel and I lent him eyes.
## I am no pilot; yet, wert thou as far
## As that vast shore wash'd with the farthest sea,
## I would adventure for such merchandise.
##
## JULIET
##
## Thou know'st the mask of night is on my face,
## Else would a maiden blush bepaint my cheek
## For that which thou hast heard me speak to-night
## Fain would I dwell on form, fain, fain deny
## What I have spoke: but farewell compliment!
## Dost thou love me? I know thou wilt say 'Ay,'
## And I will take thy word: yet if thou swear'st,
## Thou mayst prove false; at lovers' perjuries
## Then say, Jove laughs. O gentle Romeo,
## If thou dost love, pronounce it faithfully:
## Or if thou think'st I am too quickly won,
## I'll frown and be perverse an say thee nay,
## So thou wilt woo; but else, not for the world.
## In truth, fair Montague, I am too fond,
## And therefore thou mayst think my 'havior light:
## But trust me, gentleman, I'll prove more true
## Than those that have more cunning to be strange.
## I should have been more strange, I must confess,
## But that thou overheard'st, ere I was ware,
## My true love's passion: therefore pardon me,
## And not impute this yielding to light love,
## Which the dark night hath so discovered.
##
## ROMEO
##
## Lady, by yonder blessed moon I swear
## That tips with silver all these fruit-tree tops--
##
## JULIET
##
## O, swear not by the moon, the inconstant moon,
## That monthly changes in her circled orb,
## Lest that thy love prove likewise variable.
##
## ROMEO
##
## What shall I swear by?
##
## JULIET
##
## Do not swear at all;
## Or, if thou wilt, swear by thy gracious self,
## Which is the god of my idolatry,
## And I'll believe thee.
##
## ROMEO
##
## If my heart's dear love--
##
## JULIET
##
## Well, do not swear: although I joy in thee,
## I have no joy of this contract to-night:
## It is too rash, too unadvised, too sudden;
## Too like the lightning, which doth cease to be
## Ere one can say 'It lightens.' Sweet, good night!
## This bud of love, by summer's ripening breath,
## May prove a beauteous flower when next we meet.
## Good night, good night! as sweet repose and rest
## Come to thy heart as that within my breast!
##
## ROMEO
##
## O, wilt thou leave me so unsatisfied?
##
## JULIET
##
## What satisfaction canst thou have to-night?
##
## ROMEO
##
## The exchange of thy love's faithful vow for mine.
##
## JULIET
##
## I gave thee mine before thou didst request it:
## And yet I would it were to give again.
##
## ROMEO
##
## Wouldst thou withdraw it? for what purpose, love?
##
## JULIET
##
## But to be frank, and give it thee again.
## And yet I wish but for the thing I have:
## My bounty is as boundless as the sea,
## My love as deep; the more I give to thee,
## The more I have, for both are infinite.
##
## Nurse calls within
##
## I hear some noise within; dear love, adieu!
## Anon, good nurse! Sweet Montague, be true.
## Stay but a little, I will come again.
## Exit, above
##
## ROMEO
##
## O blessed, blessed night! I am afeard.
## Being in night, all this is but a dream,
## Too flattering-sweet to be substantial.
##
## Re-enter JULIET, above
##
## JULIET
##
## Three words, dear Romeo, and good night indeed.
## If that thy bent of love be honourable,
## Thy purpose marriage, send me word to-morrow,
## By one that I'll procure to come to thee,
## Where and what time thou wilt perform the rite;
## And all my fortunes at thy foot I'll lay
## And follow thee my lord throughout the world.
##
## Nurse
##
## [Within] Madam!
##
## JULIET
##
## I come, anon.--But if thou mean'st not well,
## I do beseech thee--
##
## Nurse
## [Within] Madam!
##
## JULIET
##
## By and by, I come:--
## To cease thy suit, and leave me to my grief:
## To-morrow will I send.
##
## ROMEO
##
## So thrive my soul--
##
## JULIET
##
## A thousand times good night!
## Exit, above
##
## ROMEO
##
## A thousand times the worse, to want thy light.
## Love goes toward love, as schoolboys from
## their books,
## But love from love, toward school with heavy looks.
## Retiring
##
## Re-enter JULIET, above
##
## JULIET
##
## Hist! Romeo, hist! O, for a falconer's voice,
## To lure this tassel-gentle back again!
## Bondage is hoarse, and may not speak aloud;
## Else would I tear the cave where Echo lies,
## And make her airy tongue more hoarse than mine,
## With repetition of my Romeo's name.
##
## ROMEO
##
## It is my soul that calls upon my name:
## How silver-sweet sound lovers' tongues by night,
## Like softest music to attending ears!
##
## JULIET
##
## Romeo!
##
## ROMEO
##
## My dear?
##
## JULIET
##
## At what o'clock to-morrow
## Shall I send to thee?
##
## ROMEO
##
## At the hour of nine.
##
## JULIET
##
## I will not fail: 'tis twenty years till then.
## I have forgot why I did call thee back.
##
## ROMEO
##
## Let me stand here till thou remember it.
##
## JULIET
##
## I shall forget, to have thee still stand there,
## Remembering how I love thy company.
##
## ROMEO
##
## And I'll still stay, to have thee still forget,
## Forgetting any other home but this.
##
## JULIET
##
## 'Tis almost morning; I would have thee gone:
## And yet no further than a wanton's bird;
## Who lets it hop a little from her hand,
## Like a poor prisoner in his twisted gyves,
## And with a silk thread plucks it back again,
## So loving-jealous of his liberty.
##
## ROMEO
##
## I would I were thy bird.
##
## JULIET
##
## Sweet, so would I:
## Yet I should kill thee with much cherishing.
## Good night, good night! parting is such
## sweet sorrow,
## That I shall say good night till it be morrow.
##
## Exit above
##
## ROMEO
##
## Sleep dwell upon thine eyes, peace in thy breast!
## Would I were sleep and peace, so sweet to rest!
## Hence will I to my ghostly father's cell,
## His help to crave, and my dear hap to tell.
##
## Exit
##
## False
## True
## Romeo and Juliet
##
## Act 2, Scene 2
##
##
##
## <class 'numpy.ndarray'>
## ["b'Time'" "b'Percent'"]
## [ 0. 0.357]
## Unnamed: 0 PassengerId Survived Pclass \
## 0 1 1 0 3
## 1 2 2 1 1
## 2 3 3 1 3
## 3 4 4 1 1
## 4 5 5 0 3
##
## Name Sex Age SibSp \
## 0 Braund, Mr. Owen Harris male 22.0 1
## 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
## 2 Heikkinen, Miss. Laina female 26.0 0
## 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
## 4 Allen, Mr. William Henry male 35.0 0
##
## Parch Ticket Fare Cabin Embarked
## 0 0 A/5 21171 7.2500 NaN S
## 1 0 PC 17599 71.2833 C85 C
## 2 0 STON/O2. 3101282 7.9250 NaN S
## 3 0 113803 53.1000 C123 S
## 4 0 373450 8.0500 NaN S
## <class 'numpy.ndarray'>
Example Image Recognition Digit:
Sea Slug Data:
Chapter 2 - Importing data from other file types
Introduction to other files types - Excel spreadsheets, MATLAB, SAS, Stata, HDF5 (becoming a more relevant format for saving data):
Importing SAS/Stata files using pandas:
Importing HDF5 (Hierarchical Data Format 5) files, quickly becoming the Python standard for storing large quantities of numerical data:
Importing MATLAB (MATrix LABoratory) files - industry standard in engineering and science:
Example code includes:
myPath = "./PythonInputFiles/"
# Import pickle package
import pickle
# NEED PICKLE DATA - {'Mar': '84.4', 'June': '69.4', 'Airline': '8', 'Aug': '85'}
# Created using with open(myPath + "data.pkl", "wb") as file: pickle.dump(myDict, file)
# Open pickle file and load data: d
with open(myPath + 'data.pkl', mode="rb") as file:
d = pickle.load(file)
# Print d
print(d)
# Print datatype of d
print(type(d))
# NEED BATTLE DEATHS DATA - https://www.prio.org/Data/Armed-Conflict/Battle-Deaths/The-Battle-Deaths-Dataset-version-30/ (downloaded and converted name to "battledeath.xlsx")
# Import pandas
import pandas as pd
# Assign spreadsheet filename: file
file = myPath + "battledeath.xlsx"
# Load spreadsheet: xl
xl = pd.ExcelFile(file)
# Print sheet names
print(xl.sheet_names)
# Load a sheet into a DataFrame by name: df1
# There is only one sheet absent converting "bdonly" to a file by year
df1 = xl.parse("bdonly")
# Print the head of the DataFrame df1
print(df1.head())
# Load a sheet into a DataFrame by index: df2
df2 = xl.parse(0)
# Print the head of the DataFrame df2
print(df2.head())
# Parse the first sheet and rename the columns: df1
df1 = xl.parse(0, skiprows=[0], parse_cols=[2, 9], names=["AAM due to War (2002)", "Country"])
# Print the head of the DataFrame df1
print(df1.head())
# Parse the tenth column of the first sheet and rename the column: df2
df2 = xl.parse(0, parse_cols=[9], skiprows=[0], names=["Country"])
# Print the head of the DataFrame df2
print(df2.head())
# DO NOT HAVE THIS FILE EITHER
# Import sas7bdat package
from sas7bdat import SAS7BDAT
# Save file to a DataFrame: df_sas
# with SAS7BDAT('sales.sas7bdat') as file:
# df_sas = file.to_data_frame()
# Print head of DataFrame
# print(df_sas.head())
import matplotlib.pyplot as plt
# Plot histogram of DataFrame features (pandas and pyplot already imported)
# pd.DataFrame.hist(df_sas[['P']])
# plt.ylabel('count')
# plt.show()
# DO NOT HAVE THIS FILE EITHER
# Import pandas
# Load Stata file into a pandas DataFrame: df
# df = pd.read_stata("disarea.dta")
# Print the head of the DataFrame df
# print(df.head())
# Plot histogram of one column of the DataFrame
# pd.DataFrame.hist(df[['disa10']])
# plt.xlabel('Extent of disease')
# plt.ylabel('Number of coutries')
# plt.show()
# DO NOT HAVE THIS FILE EITHER
# Import packages
import numpy as np
import h5py
# Assign filename: file
# file = 'LIGO_data.hdf5'
# Load file: data
# data = h5py.File(file, "r")
# Print the datatype of the loaded file
# print(type(data))
# Print the keys of the file
# for key in data.keys():
# print(key)
# Get the HDF5 group: group
# group = data["strain"]
# Check out keys of group
# for key in group.keys():
# print(key)
# Set variable equal to time series data: strain
# strain = data['strain']['Strain'].value
# Set number of time points to sample: num_samples
# num_samples = 10000
# Set time vector
# time = np.arange(0, 1, 1/num_samples)
# Plot data
# plt.plot(time, strain[:num_samples])
# plt.xlabel('GPS Time (s)')
# plt.ylabel('strain')
# plt.show()
# DO NOT HAVE THIS FILE EITHER - see https://www.mcb.ucdavis.edu/faculty-labs/albeck/workshop.htm
# Import package (cannot get to download)
# import scipy.io
# Load MATLAB file: mat
# mat = scipy.io.loadmat('albeck_gene_expression.mat')
# Print the datatype type of mat
# print(type(mat))
# Print the keys of the MATLAB dictionary
# print(mat.keys())
# Print the type of the value corresponding to the key 'CYratioCyt'
# print(type(mat["CYratioCyt"]))
# Print the shape of the value corresponding to the key 'CYratioCyt'
# print(np.shape(mat["CYratioCyt"]))
# Subset the array and plot it
# data = mat['CYratioCyt'][25, 5:]
# fig = plt.figure()
# plt.plot(data)
# plt.xlabel('time (min.)')
# plt.ylabel('normalized fluorescence (measure of expression)')
# plt.show()
## {'Mar': '84.4', 'June': '69.4', 'Airline': '8', 'Aug': '85'}
## <class 'dict'>
## ['bdonly']
## id year bdeadlow bdeadhig bdeadbes annualdata source bdversion \
## 0 1 1946 1000 9999 1000 2 1 3
## 1 1 1952 450 3000 -999 2 1 3
## 2 1 1967 25 999 82 2 1 3
## 3 2 1946 25 999 -999 0 0 3
## 4 2 1947 25 999 -999 0 0 3
##
## location sidea ... epend ependdate ependprec gwnoa gwnoa2nd \
## 0 Bolivia Bolivia ... 1 1946-07-21 -99.0 145 NaN
## 1 Bolivia Bolivia ... 1 1952-04-12 -99.0 145 NaN
## 2 Bolivia Bolivia ... 1 1967-10-16 -99.0 145 NaN
## 3 Cambodia France ... 0 NaT NaN 220 NaN
## 4 Cambodia France ... 0 NaT NaN 220 NaN
##
## gwnob gwnob2nd gwnoloc region version
## 0 NaN NaN 145 5 2009-4
## 1 NaN NaN 145 5 2009-4
## 2 NaN NaN 145 5 2009-4
## 3 NaN NaN 811 3 2009-4
## 4 NaN NaN 811 3 2009-4
##
## [5 rows x 32 columns]
## id year bdeadlow bdeadhig bdeadbes annualdata source bdversion \
## 0 1 1946 1000 9999 1000 2 1 3
## 1 1 1952 450 3000 -999 2 1 3
## 2 1 1967 25 999 82 2 1 3
## 3 2 1946 25 999 -999 0 0 3
## 4 2 1947 25 999 -999 0 0 3
##
## location sidea ... epend ependdate ependprec gwnoa gwnoa2nd \
## 0 Bolivia Bolivia ... 1 1946-07-21 -99.0 145 NaN
## 1 Bolivia Bolivia ... 1 1952-04-12 -99.0 145 NaN
## 2 Bolivia Bolivia ... 1 1967-10-16 -99.0 145 NaN
## 3 Cambodia France ... 0 NaT NaN 220 NaN
## 4 Cambodia France ... 0 NaT NaN 220 NaN
##
## gwnob gwnob2nd gwnoloc region version
## 0 NaN NaN 145 5 2009-4
## 1 NaN NaN 145 5 2009-4
## 2 NaN NaN 145 5 2009-4
## 3 NaN NaN 811 3 2009-4
## 4 NaN NaN 811 3 2009-4
##
## [5 rows x 32 columns]
## AAM due to War (2002) Country
## 0 450 Bolivia
## 1 25 Bolivia
## 2 25 France
## 3 25 France
## 4 25 France
## Country
## 0 Bolivia
## 1 Bolivia
## 2 France
## 3 France
## 4 France
Chapter 3 - Relational databases
Introduction to relational databases - standard discussion of how a relational database (system of tables) works:
Creating a database engine in Python - goal is to get data out of the relational database using SQL:
Querying relational databases in Python - connecting to the engine and then querying (getting data out from) the database:
Querying relational databases directly with pandas - shortcut to the above process:
Advanced querying - exploiting table relationships (combining mutliple tables):
Example code includes:
myPath = "./PythonInputFiles/"
# NEED FILE - may be able to get at http://chinookdatabase.codeplex.com/
# Downloaded the ZIP, extracted the SQLite, and renamed to Chinook.sqlite
# Import necessary module
from sqlalchemy import create_engine
# Create engine: engine
engine = create_engine('sqlite:///' + myPath + 'Chinook.sqlite') # The sqlite:/// is called the 'connection string'
# Save the table names to a list: table_names
table_names = engine.table_names()
# Print the table names to the shell
print(table_names)
# Import packages
from sqlalchemy import create_engine
import pandas as pd
# Create engine: engine
engine = create_engine('sqlite:///' + myPath + 'Chinook.sqlite')
# Open engine connection: con
con = engine.connect()
# Perform query: rs
rs = con.execute("SELECT * FROM Album")
# Save results of the query to DataFrame: df
df = pd.DataFrame(rs.fetchall())
# Close connection
con.close()
# Print head of DataFrame df
print(df.head())
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute("SELECT LastName, Title FROM Employee")
df = pd.DataFrame(rs.fetchmany(size=3))
df.columns = rs.keys()
# Print the length of the DataFrame df
print(len(df))
# Print the head of the DataFrame df
print(df.head())
# Create engine: engine
engine = create_engine('sqlite:///' + myPath + 'Chinook.sqlite')
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute("SELECT * FROM Employee WHERE EmployeeID >= 6")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
# Print the head of the DataFrame df
print(df.head())
# Create engine: engine
engine = create_engine('sqlite:///' + myPath + 'Chinook.sqlite')
# Open engine in context manager
with engine.connect() as con:
rs = con.execute("SELECT * FROM Employee ORDER BY BirthDate")
df = pd.DataFrame(rs.fetchall())
# Set the DataFrame's column names
df.columns = rs.keys()
# Print head of DataFrame
print(df.head())
# Import packages
from sqlalchemy import create_engine
import pandas as pd
# Create engine: engine
engine = create_engine('sqlite:///' + myPath + 'Chinook.sqlite')
# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM Album", engine)
# Print head of DataFrame
print(df.head())
# Open engine in context manager
# Perform query and save results to DataFrame: df1
with engine.connect() as con:
rs = con.execute("SELECT * FROM Album")
df1 = pd.DataFrame(rs.fetchall())
df1.columns = rs.keys()
# Confirm that both methods yield the same result: does df = df1 ?
print(df.equals(df1))
# Import packages
from sqlalchemy import create_engine
import pandas as pd
# Create engine: engine
engine = create_engine('sqlite:///' + myPath + 'Chinook.sqlite')
# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM Employee WHERE EmployeeId >= 6 ORDER BY BirthDate", engine)
# Print head of DataFrame
print(df.head())
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute("SELECT Title, Name FROM Album INNER JOIN Artist ON Album.ArtistID = Artist.ArtistID")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
# Print head of DataFrame df
print(df.head())
# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM PlaylistTrack INNER JOIN Track ON PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000", engine)
# Print head of DataFrame
print(df.head())
## ['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track']
## 0 1 2
## 0 1 For Those About To Rock We Salute You 1
## 1 2 Balls to the Wall 2
## 2 3 Restless and Wild 2
## 3 4 Let There Be Rock 1
## 4 5 Big Ones 3
## 3
## LastName Title
## 0 Adams General Manager
## 1 Edwards Sales Manager
## 2 Peacock Sales Support Agent
## EmployeeId LastName FirstName Title ReportsTo BirthDate \
## 0 6 Mitchell Michael IT Manager 1 1973-07-01 00:00:00
## 1 7 King Robert IT Staff 6 1970-05-29 00:00:00
## 2 8 Callahan Laura IT Staff 6 1968-01-09 00:00:00
##
## HireDate Address City State Country \
## 0 2003-10-17 00:00:00 5827 Bowness Road NW Calgary AB Canada
## 1 2004-01-02 00:00:00 590 Columbia Boulevard West Lethbridge AB Canada
## 2 2004-03-04 00:00:00 923 7 ST NW Lethbridge AB Canada
##
## PostalCode Phone Fax Email
## 0 T3B 0C5 +1 (403) 246-9887 +1 (403) 246-9899 michael@chinookcorp.com
## 1 T1K 5N8 +1 (403) 456-9986 +1 (403) 456-8485 robert@chinookcorp.com
## 2 T1H 1Y8 +1 (403) 467-3351 +1 (403) 467-8772 laura@chinookcorp.com
## EmployeeId LastName FirstName Title ReportsTo \
## 0 4 Park Margaret Sales Support Agent 2.0
## 1 2 Edwards Nancy Sales Manager 1.0
## 2 1 Adams Andrew General Manager NaN
## 3 5 Johnson Steve Sales Support Agent 2.0
## 4 8 Callahan Laura IT Staff 6.0
##
## BirthDate HireDate Address City \
## 0 1947-09-19 00:00:00 2003-05-03 00:00:00 683 10 Street SW Calgary
## 1 1958-12-08 00:00:00 2002-05-01 00:00:00 825 8 Ave SW Calgary
## 2 1962-02-18 00:00:00 2002-08-14 00:00:00 11120 Jasper Ave NW Edmonton
## 3 1965-03-03 00:00:00 2003-10-17 00:00:00 7727B 41 Ave Calgary
## 4 1968-01-09 00:00:00 2004-03-04 00:00:00 923 7 ST NW Lethbridge
##
## State Country PostalCode Phone Fax \
## 0 AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289
## 1 AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322
## 2 AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457
## 3 AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543
## 4 AB Canada T1H 1Y8 +1 (403) 467-3351 +1 (403) 467-8772
##
## Email
## 0 margaret@chinookcorp.com
## 1 nancy@chinookcorp.com
## 2 andrew@chinookcorp.com
## 3 steve@chinookcorp.com
## 4 laura@chinookcorp.com
## AlbumId Title ArtistId
## 0 1 For Those About To Rock We Salute You 1
## 1 2 Balls to the Wall 2
## 2 3 Restless and Wild 2
## 3 4 Let There Be Rock 1
## 4 5 Big Ones 3
## True
## EmployeeId LastName FirstName Title ReportsTo BirthDate \
## 0 8 Callahan Laura IT Staff 6 1968-01-09 00:00:00
## 1 7 King Robert IT Staff 6 1970-05-29 00:00:00
## 2 6 Mitchell Michael IT Manager 1 1973-07-01 00:00:00
##
## HireDate Address City State Country \
## 0 2004-03-04 00:00:00 923 7 ST NW Lethbridge AB Canada
## 1 2004-01-02 00:00:00 590 Columbia Boulevard West Lethbridge AB Canada
## 2 2003-10-17 00:00:00 5827 Bowness Road NW Calgary AB Canada
##
## PostalCode Phone Fax Email
## 0 T1H 1Y8 +1 (403) 467-3351 +1 (403) 467-8772 laura@chinookcorp.com
## 1 T1K 5N8 +1 (403) 456-9986 +1 (403) 456-8485 robert@chinookcorp.com
## 2 T3B 0C5 +1 (403) 246-9887 +1 (403) 246-9899 michael@chinookcorp.com
## Title Name
## 0 For Those About To Rock We Salute You AC/DC
## 1 Balls to the Wall Accept
## 2 Restless and Wild Accept
## 3 Let There Be Rock AC/DC
## 4 Big Ones Aerosmith
## PlaylistId TrackId TrackId Name AlbumId MediaTypeId \
## 0 1 3390 3390 One and the Same 271 2
## 1 1 3392 3392 Until We Fall 271 2
## 2 1 3393 3393 Original Fire 271 2
## 3 1 3394 3394 Broken City 271 2
## 4 1 3395 3395 Somedays 271 2
##
## GenreId Composer Milliseconds Bytes UnitPrice
## 0 23 None 217732 3559040 0.99
## 1 23 None 230758 3766605 0.99
## 2 23 None 218916 3577821 0.99
## 3 23 None 228366 3728955 0.99
## 4 23 None 213831 3497176 0.99
Chapter 1 - Importing Data from the Internet
Importing flat files from the web - non-local files:
HTTP requests to import files from the web - unpacking the urlretrieve from urllib.request:
Scraping the web in Python using BeautifulSoup - make sense of the jumbled, unstructured HTML data:
Example code includes:
# Import package
from urllib.request import urlretrieve
import pandas as pd
# Assign url of file: url (ran once - no need to re-run)
# url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
# Save file locally
# urlretrieve(url, 'winequality-red.csv')
# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
print(df.head())
# Import packages
import matplotlib.pyplot as plt
import pandas as pd
# Assign url of file: url (ran once - no need to re-run)
# url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
# Read file into a DataFrame: df
# df = pd.read_csv(url, sep=";")
# Print the head of the DataFrame
# print(df.head())
# Plot first column of df
pd.DataFrame.hist(df.ix[:, 0:1])
plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')
plt.ylabel('count')
# plt.show()
plt.savefig("_dummyPy044.png", bbox_inches="tight")
plt.clf()
# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'
# Read in all sheets of Excel file: xl
xl = pd.read_excel(url, sheetname=None)
# Print the sheetnames to the shell
print(xl.keys())
# Print the head of the first sheet (using its name, NOT its index)
print(xl["1700"].head())
# Import packages
from urllib.request import urlopen, Request
# Specify the url
url = "http://www.datacamp.com/teach/documentation"
# This packages the request: request
request = Request(url)
# Sends the request and catches the response: response
response = urlopen(request)
# Print the datatype of response
print(type(response))
# Be polite and close the response!
response.close()
# Specify the url
url = "http://docs.datacamp.com/teach/"
# This packages the request
request = Request(url)
# Sends the request and catches the response: response
response = urlopen(request)
# Extract the response: html
html = response.read()
# Print the html
print(html)
# Be polite and close the response!
response.close()
import requests
# Specify the url: url
url = "http://docs.datacamp.com/teach/"
# Packages the request, send the request and catch the response: r
r = requests.get(url)
# Extract the response: text
text = r.text
# Print the html
print(text)
# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url: url
url = 'https://www.python.org/~guido/'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extracts the response as html: html_doc
html_doc = r.text
# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()
# Print the response
print(pretty_soup)
# Get the title of Guido's webpage: guido_title
guido_title = soup.title
# Print the title of Guido's webpage to the shell
print(guido_title)
# Get Guido's text: guido_text
guido_text = soup.get_text()
# Print Guido's text to the shell
print(guido_text)
# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all("a")
# Print the URLs to the shell
for link in a_tags:
print(link.get("href"))
## fixed acidity volatile acidity citric acid residual sugar chlorides \
## 0 7.4 0.70 0.00 1.9 0.076
## 1 7.8 0.88 0.00 2.6 0.098
## 2 7.8 0.76 0.04 2.3 0.092
## 3 11.2 0.28 0.56 1.9 0.075
## 4 7.4 0.70 0.00 1.9 0.076
##
## free sulfur dioxide total sulfur dioxide density pH sulphates \
## 0 11.0 34.0 0.9978 3.51 0.56
## 1 25.0 67.0 0.9968 3.20 0.68
## 2 15.0 54.0 0.9970 3.26 0.65
## 3 17.0 60.0 0.9980 3.16 0.58
## 4 11.0 34.0 0.9978 3.51 0.56
##
## alcohol quality
## 0 9.4 5
## 1 9.8 5
## 2 9.8 5
## 3 9.8 6
## 4 9.4 5
## odict_keys(['1700', '1900'])
## country 1700
## 0 Afghanistan 34.565000
## 1 Akrotiri and Dhekelia 34.616667
## 2 Albania 41.312000
## 3 Algeria 36.720000
## 4 American Samoa -14.307000
## <class 'http.client.HTTPResponse'>
## b'<!DOCTYPE html>\n<link rel="shortcut icon" href="images/favicon.ico" />\n<html>\n\n <head>\n <meta charset="utf-8">\n <meta http-equiv="X-UA-Compatible" content="IE=edge">\n <meta name="viewport" content="width=device-width, initial-scale=1">\n\n <title>Home</title>\n <meta name="description" content="All Documentation on Course Creation">\n\n <link rel="stylesheet" href="/teach/css/main.css">\n <link rel="canonical" href="/teach/">\n <link rel="alternate" type="application/rss+xml" title="DataCamp Teach Documentation" href="/teach/feed.xml" />\n</head>\n\n\n <body>\n\n <header class="site-header">\n\n <div class="wrapper">\n\n <a class="site-title" href="/teach/">DataCamp Teach Documentation</a>\n\n </div>\n\n</header>\n\n\n <div class="page-content">\n <div class="wrapper">\n <p>The Teach Documentation has been moved to <a href="https://www.datacamp.com/teach/documentation">https://www.datacamp.com/teach/documentation</a>!</p>\n\n<!-- Everybody can teach on DataCamp. The resources on this website explain all the steps to build your own course on DataCamp\'s interactive data science platform.\n\nInterested in partnering with DataCamp? Head over to the [Course Material](/teach/course-material.html) page to get an idea of the requirements to build your own interactive course together with DataCamp!\n\n## Table of Contents\n\n- [Course Material](/teach/course-material.html) - Content required to build a DataCamp course.\n- [Video Lectures](/teach/video-lectures.html) - Details on video recording and editing.\n- [DataCamp Teach](https://www.datacamp.com/teach) - Use the DataCamp Teach website to create DataCamp courses (preferred).\n- [datacamp R Package](https://github.com/datacamp/datacamp/wiki) - Use R Package to create DataCamp courses (legacy).\n- [Code DataCamp Exercises](/teach/code-datacamp-exercises.html)\n- [SCT Design (R)](https://github.com/datacamp/testwhat/wiki)\n- [SCT Design (Python)](https://github.com/datacamp/pythonwhat/wiki)\n- [Style Guide](/teach/style-guide.html) -->\n\n\n </div>\n </div>\n\n \n\n </body>\n\n</html>\n'
## <!DOCTYPE html>
## <link rel="shortcut icon" href="images/favicon.ico" />
## <html>
##
## <head>
## <meta charset="utf-8">
## <meta http-equiv="X-UA-Compatible" content="IE=edge">
## <meta name="viewport" content="width=device-width, initial-scale=1">
##
## <title>Home</title>
## <meta name="description" content="All Documentation on Course Creation">
##
## <link rel="stylesheet" href="/teach/css/main.css">
## <link rel="canonical" href="/teach/">
## <link rel="alternate" type="application/rss+xml" title="DataCamp Teach Documentation" href="/teach/feed.xml" />
## </head>
##
##
## <body>
##
## <header class="site-header">
##
## <div class="wrapper">
##
## <a class="site-title" href="/teach/">DataCamp Teach Documentation</a>
##
## </div>
##
## </header>
##
##
## <div class="page-content">
## <div class="wrapper">
## <p>The Teach Documentation has been moved to <a href="https://www.datacamp.com/teach/documentation">https://www.datacamp.com/teach/documentation</a>!</p>
##
## <!-- Everybody can teach on DataCamp. The resources on this website explain all the steps to build your own course on DataCamp's interactive data science platform.
##
## Interested in partnering with DataCamp? Head over to the [Course Material](/teach/course-material.html) page to get an idea of the requirements to build your own interactive course together with DataCamp!
##
## ## Table of Contents
##
## - [Course Material](/teach/course-material.html) - Content required to build a DataCamp course.
## - [Video Lectures](/teach/video-lectures.html) - Details on video recording and editing.
## - [DataCamp Teach](https://www.datacamp.com/teach) - Use the DataCamp Teach website to create DataCamp courses (preferred).
## - [datacamp R Package](https://github.com/datacamp/datacamp/wiki) - Use R Package to create DataCamp courses (legacy).
## - [Code DataCamp Exercises](/teach/code-datacamp-exercises.html)
## - [SCT Design (R)](https://github.com/datacamp/testwhat/wiki)
## - [SCT Design (Python)](https://github.com/datacamp/pythonwhat/wiki)
## - [Style Guide](/teach/style-guide.html) -->
##
##
## </div>
## </div>
##
##
##
## </body>
##
## </html>
##
## <html>
## <head>
## <title>
## Guido's Personal Home Page
## </title>
## </head>
## <body bgcolor="#FFFFFF" text="#000000">
## <h1>
## <a href="pics.html">
## <img border="0" src="images/IMG_2192.jpg"/>
## </a>
## Guido van Rossum - Personal Home Page
## </h1>
## <p>
## <a href="http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm">
## <i>
## "Gawky and proud of it."
## </i>
## </a>
## <h3>
## <a href="http://metalab.unc.edu/Dave/Dr-Fun/df200004/df20000406.jpg">
## Who
## I Am
## </a>
## </h3>
## <p>
## Read
## my
## <a href="http://neopythonic.blogspot.com/2016/04/kings-day-speech.html">
## "King's
## Day Speech"
## </a>
## for some inspiration.
## <p>
## I am the author of the
## <a href="http://www.python.org">
## Python
## </a>
## programming language. See also my
## <a href="Resume.html">
## resume
## </a>
## and my
## <a href="Publications.html">
## publications list
## </a>
## , a
## <a href="bio.html">
## brief bio
## </a>
## , assorted
## <a href="http://legacy.python.org/doc/essays/">
## writings
## </a>
## ,
## <a href="http://legacy.python.org/doc/essays/ppt/">
## presentations
## </a>
## and
## <a href="interviews.html">
## interviews
## </a>
## (all about Python), some
## <a href="pics.html">
## pictures of me
## </a>
## ,
## <a href="http://neopythonic.blogspot.com">
## my new blog
## </a>
## , and
## my
## <a href="http://www.artima.com/weblogs/index.jsp?blogger=12088">
## old
## blog
## </a>
## on Artima.com. I am
## <a href="https://twitter.com/gvanrossum">
## @gvanrossum
## </a>
## on Twitter. I
## also have
## a
## <a href="https://plus.google.com/u/0/115212051037621986145/posts">
## G+
## profile
## </a>
## .
## <p>
## In January 2013 I joined
## <a href="http://www.dropbox.com">
## Dropbox
## </a>
## . I work on various Dropbox
## products and have 50% for my Python work, no strings attached.
## Previously, I have worked for Google, Elemental Security, Zope
## Corporation, BeOpen.com, CNRI, CWI, and SARA. (See
## my
## <a href="Resume.html">
## resume
## </a>
## .) I created Python while at CWI.
## <h3>
## How to Reach Me
## </h3>
## <p>
## You can send email for me to guido (at) python.org.
## I read everything sent there, but if you ask
## me a question about using Python, it's likely that I won't have time
## to answer it, and will instead refer you to
## help (at) python.org,
## <a href="http://groups.google.com/groups?q=comp.lang.python">
## comp.lang.python
## </a>
## or
## <a href="http://stackoverflow.com">
## StackOverflow
## </a>
## . If you need to
## talk to me on the phone or send me something by snail mail, send me an
## email and I'll gladly email you instructions on how to reach me.
## <h3>
## My Name
## </h3>
## <p>
## My name often poses difficulties for Americans.
## <p>
## <b>
## Pronunciation:
## </b>
## in Dutch, the "G" in Guido is a hard G,
## pronounced roughly like the "ch" in Scottish "loch". (Listen to the
## <a href="guido.au">
## sound clip
## </a>
## .) However, if you're
## American, you may also pronounce it as the Italian "Guido". I'm not
## too worried about the associations with mob assassins that some people
## have. :-)
## <p>
## <b>
## Spelling:
## </b>
## my last name is two words, and I'd like to keep it
## that way, the spelling on some of my credit cards notwithstanding.
## Dutch spelling rules dictate that when used in combination with my
## first name, "van" is not capitalized: "Guido van Rossum". But when my
## last name is used alone to refer to me, it is capitalized, for
## example: "As usual, Van Rossum was right."
## <p>
## <b>
## Alphabetization:
## </b>
## in America, I show up in the alphabet under
## "V". But in Europe, I show up under "R". And some of my friends put
## me under "G" in their address book...
## <h3>
## More Hyperlinks
## </h3>
## <ul>
## <li>
## Here's a collection of
## <a href="http://legacy.python.org/doc/essays/">
## essays
## </a>
## relating to Python
## that I've written, including the foreword I wrote for Mark Lutz' book
## "Programming Python".
## <p>
## <li>
## I own the official
## <a href="images/license.jpg">
## <img align="center" border="0" height="75" src="images/license_thumb.jpg" width="100"/>
## Python license.
## </a>
## <p>
## </p>
## </li>
## </p>
## </li>
## </ul>
## <h3>
## The Audio File Formats FAQ
## </h3>
## <p>
## I was the original creator and maintainer of the Audio File Formats
## FAQ. It is now maintained by Chris Bagwell
## at
## <a href="http://www.cnpbagwell.com/audio-faq">
## http://www.cnpbagwell.com/audio-faq
## </a>
## . And here is a link to
## <a href="http://sox.sourceforge.net/">
## SOX
## </a>
## , to which I contributed
## some early code.
## </p>
## </p>
## </p>
## </p>
## </p>
## </p>
## </p>
## </p>
## </p>
## </p>
## </body>
## </html>
## <hr/>
## <a href="images/internetdog.gif">
## "On the Internet, nobody knows you're
## a dog."
## </a>
## <hr/>
## C:\Users\Dave\AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
##
## The code that caused this warning is on line 119 of the file <string>. To get rid of this warning, change code that looks like this:
##
## BeautifulSoup(YOUR_MARKUP})
##
## to this:
##
## BeautifulSoup(YOUR_MARKUP, "html.parser")
##
## markup_type=markup_type))
##
## <title>Guido's Personal Home Page</title>
##
##
## Guido's Personal Home Page
##
##
##
##
## Guido van Rossum - Personal Home Page
## "Gawky and proud of it."
## Who
## I Am
## Read
## my "King's
## Day Speech" for some inspiration.
##
## I am the author of the Python
## programming language. See also my resume
## and my publications list, a brief bio, assorted writings, presentations and interviews (all about Python), some
## pictures of me,
## my new blog, and
## my old
## blog on Artima.com. I am
## @gvanrossum on Twitter. I
## also have
## a G+
## profile.
##
## In January 2013 I joined
## Dropbox. I work on various Dropbox
## products and have 50% for my Python work, no strings attached.
## Previously, I have worked for Google, Elemental Security, Zope
## Corporation, BeOpen.com, CNRI, CWI, and SARA. (See
## my resume.) I created Python while at CWI.
##
## How to Reach Me
## You can send email for me to guido (at) python.org.
## I read everything sent there, but if you ask
## me a question about using Python, it's likely that I won't have time
## to answer it, and will instead refer you to
## help (at) python.org,
## comp.lang.python or
## StackOverflow. If you need to
## talk to me on the phone or send me something by snail mail, send me an
## email and I'll gladly email you instructions on how to reach me.
##
## My Name
## My name often poses difficulties for Americans.
##
## Pronunciation: in Dutch, the "G" in Guido is a hard G,
## pronounced roughly like the "ch" in Scottish "loch". (Listen to the
## sound clip.) However, if you're
## American, you may also pronounce it as the Italian "Guido". I'm not
## too worried about the associations with mob assassins that some people
## have. :-)
##
## Spelling: my last name is two words, and I'd like to keep it
## that way, the spelling on some of my credit cards notwithstanding.
## Dutch spelling rules dictate that when used in combination with my
## first name, "van" is not capitalized: "Guido van Rossum". But when my
## last name is used alone to refer to me, it is capitalized, for
## example: "As usual, Van Rossum was right."
##
## Alphabetization: in America, I show up in the alphabet under
## "V". But in Europe, I show up under "R". And some of my friends put
## me under "G" in their address book...
##
##
## More Hyperlinks
##
## Here's a collection of essays relating to Python
## that I've written, including the foreword I wrote for Mark Lutz' book
## "Programming Python".
## I own the official
## Python license.
##
## The Audio File Formats FAQ
## I was the original creator and maintainer of the Audio File Formats
## FAQ. It is now maintained by Chris Bagwell
## at http://www.cnpbagwell.com/audio-faq. And here is a link to
## SOX, to which I contributed
## some early code.
##
##
##
## "On the Internet, nobody knows you're
## a dog."
##
##
##
## pics.html
## http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm
## http://metalab.unc.edu/Dave/Dr-Fun/df200004/df20000406.jpg
## http://neopythonic.blogspot.com/2016/04/kings-day-speech.html
## http://www.python.org
## Resume.html
## Publications.html
## bio.html
## http://legacy.python.org/doc/essays/
## http://legacy.python.org/doc/essays/ppt/
## interviews.html
## pics.html
## http://neopythonic.blogspot.com
## http://www.artima.com/weblogs/index.jsp?blogger=12088
## https://twitter.com/gvanrossum
## https://plus.google.com/u/0/115212051037621986145/posts
## http://www.dropbox.com
## Resume.html
## http://groups.google.com/groups?q=comp.lang.python
## http://stackoverflow.com
## guido.au
## http://legacy.python.org/doc/essays/
## images/license.jpg
## http://www.cnpbagwell.com/audio-faq
## http://sox.sourceforge.net/
## images/internetdog.gif
Acidity of Red Wine:
Chapter 2 - Interacting with APIs
Introduction to APIs (Application Programming Interface) and JSON (JavaScript Object Notation):
APIs and interacting with the world-wide web - what APIs are and why they are important:
Example code includes:
myPath = "./PythonInputFiles/"
# DO NOT HAVE FILE a_movie.json, which appears to be JSON for the movie Social Network (2010)
# Created and saved file
import json
# Load JSON: json_data
with open(myPath + "a_movie.json") as json_file:
json_data = json.load(json_file)
# Print each key-value pair in json_data
for k in json_data.keys():
print(k + ': ', json_data[k])
# PROBABLY DO NOT RUN; NEED API KEY
# Import requests package
import requests
# Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=ff21610b&t=social+network'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Print the text of the response
print(r.text)
# Decode the JSON data into a dictionary: json_data
json_data = r.json()
# Print each key-value pair in json_data
for k in json_data.keys():
print(k + ': ', json_data[k])
# Assign URL to variable: url
url = "https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza"
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Decode the JSON data into a dictionary: json_data
json_data = r.json()
# Print the Wikipedia page extract
pizza_extract = json_data['query']['pages']['24768']['extract']
print(pizza_extract)
## imdbRating: 7.7
## Rated: PG-13
## Year: 2010
## DVD: N/A
## Ratings: [{'Value': '7.7/10', 'Source': 'Internet Movie Database'}, {'Value': '96%', 'Source': 'Rotten Tomatoes'}, {'Value': '95/100', 'Source': 'Metacritic'}]
## Metascore: 95
## Runtime: 120 min
## Released: 01 Oct 2010
## Plot: Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.
## Poster: https://images-na.ssl-images-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg
## imdbVotes: 508,540
## Director: David Fincher
## Website: http://www.thesocialnetwork-movie.com/
## Writer: Aaron Sorkin (screenplay), Ben Mezrich (book)
## Awards: Won 3 Oscars. Another 162 wins & 162 nominations.
## Language: English, French
## Genre: Biography, Drama
## Type: movie
## Country: USA
## Production: Columbia Pictures
## imdbID: tt1285016
## Actors: Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
## BoxOffice: $96,400,000
## Response: True
## Title: The Social Network
## {"Title":"The Social Network","Year":"2010","Rated":"PG-13","Released":"01 Oct 2010","Runtime":"120 min","Genre":"Biography, Drama","Director":"David Fincher","Writer":"Aaron Sorkin (screenplay), Ben Mezrich (book)","Actors":"Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons","Plot":"Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.","Language":"English, French","Country":"USA","Awards":"Won 3 Oscars. Another 165 wins & 168 nominations.","Poster":"https://images-na.ssl-images-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"7.7/10"},{"Source":"Rotten Tomatoes","Value":"96%"},{"Source":"Metacritic","Value":"95/100"}],"Metascore":"95","imdbRating":"7.7","imdbVotes":"511,136","imdbID":"tt1285016","Type":"movie","DVD":"11 Jan 2011","BoxOffice":"$96,400,000","Production":"Columbia Pictures","Website":"http://www.thesocialnetwork-movie.com/","Response":"True"}
## Title: The Social Network
## Year: 2010
## Rated: PG-13
## Released: 01 Oct 2010
## Runtime: 120 min
## Genre: Biography, Drama
## Director: David Fincher
## Writer: Aaron Sorkin (screenplay), Ben Mezrich (book)
## Actors: Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
## Plot: Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.
## Language: English, French
## Country: USA
## Awards: Won 3 Oscars. Another 165 wins & 168 nominations.
## Poster: https://images-na.ssl-images-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg
## Ratings: [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
## Metascore: 95
## imdbRating: 7.7
## imdbVotes: 511,136
## imdbID: tt1285016
## Type: movie
## DVD: 11 Jan 2011
## BoxOffice: $96,400,000
## Production: Columbia Pictures
## Website: http://www.thesocialnetwork-movie.com/
## Response: True
## <p><b>Pizza</b> is a yeasted flatbread typically topped with tomato sauce and cheese and baked in an oven. It is commonly topped with a selection of meats, vegetables and condiments. The term was first recorded in the 10th century, in a Latin manuscript from Gaeta in Central Italy. The modern pizza was invented in Naples, Italy, and the dish and its variants have since become popular and common in many areas of the world.</p>
## <p>In 2009, upon Italy's request, Neapolitan pizza was safeguarded in the European Union as a Traditional Speciality Guaranteed dish. The <i>Associazione Verace Pizza Napoletana</i> (the True Neapolitan Pizza Association) is a non-profit organization founded in 1984 with headquarters in Naples. It promotes and protects the "true Neapolitan pizza".</p>
## <p>Pizza is sold fresh or frozen, either whole or in portions, and is a common fast food item in Europe and North America. Various types of ovens are used to cook them and many varieties exist. Several similar dishes are prepared from ingredients commonly used in pizza preparation, such as calzone and stromboli.</p>
## <p></p>
Chapter 3 - Diving deeper in to the Twitter API
Twitter API and Authentication - 1) Twitter API, 2) filtering tweets, 3) API Authentication and Oauth, 4) Python package “tweepy”:
Example code includes:
# DO NOT RUN THIS - NO IDEA WHOSE KEYS THESE ARE (DataCamp???)
# Import package
import tweepy
# Store OAuth authentication credentials in relevant variables
access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy"
access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"
consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM"
consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i"
# Pass OAuth details to tweepy's OAuth handler
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
# The class MyStreamListener is available at https://gist.github.com/hugobowne/18f1c0c0709ed1a52dc5bcd462ac69f4
# Initialize Stream listener
l = MyStreamListener()
# Create you Stream object with authentication
stream = tweepy.Stream(auth, l)
# Filter Twitter Streams to capture data by the keywords:
stream.filter(track=['clinton', 'trump', 'sanders', 'cruz'])
# Import package
import json
# String of path to file: tweets_data_path
tweets_data_path = "tweets.txt"
# Initialize empty list to store tweets: tweets_data
tweets_data = []
# Open connection to file
tweets_file = open(tweets_data_path, "r")
# Read in tweets and store in list: tweets_data
for line in tweets_file:
tweet = json.loads(line)
tweets_data.append(tweet)
# Close connection to file
tweets_file.close()
# Print the keys of the first tweet dict
print(tweets_data[0].keys())
# Import package
import pandas as pd
# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns=["text", "lang"])
# Print head of DataFrame
print(df.head())
def word_in_text(word, tweet):
word = word.lower()
text = tweet.lower()
match = re.search(word, tweet)
if match:
return True
return False
# Initialize list to store tweet counts
[clinton, trump, sanders, cruz] = [0, 0, 0, 0]
# Iterate through df, counting the number of tweets in which
# each candidate is mentioned
for index, row in df.iterrows():
clinton += word_in_text('clinton', row['text'])
trump += word_in_text('trump', row['text'])
sanders += word_in_text('sanders', row['text'])
cruz += word_in_text('cruz', row['text'])
# Import packages
import matplotlib.pyplot as plt
import seaborn as sns
# Set seaborn style
sns.set(color_codes=True)
# Create a list of labels:cd
cd = ['clinton', 'trump', 'sanders', 'cruz']
# Plot histogram
ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
ax.set(ylabel="count")
plt.show()
Chapter 1 - Exploring Your Data
Diagnose data for cleaning - column names, missing data, outliers, duplicate rows, un-tidy data, unexpected data values, etc.:
Exploratory data analysis - suppose that a pandas DataFrame, df, has already been created:
Visual exploratory data analysis - easy way to spot outliers and obvious errors - assume again that a pandas DataFrame, df, has already been explained:
Example code includes:
# Downloaded small portion to myPath from https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2/data
# tempData = pd.read_csv(myPath + "DOB_JOB_Application_Filings.csv")
# keyCols = ["Borough", "State", "Site Fill", "Existing Zoning Sqft", "Initial Cost", "Total Est. Fee"]
# useData = tempData[keyCols]
# useData.loc[:, "initial_cost"] = [float(d[1:]) for d in useData["Initial Cost"]]
# useData.loc[:, "total_est_fee"] = [float(d[1:]) for d in useData["Total Est. Fee"]]
# useData.to_csv(myPath + "dob_job_application_filings_subset.csv")
# MAY NEED TO GET DATA FROM https://opendata.cityofnewyork.us/
# Import pandas
import pandas as pd
myPath = "./PythonInputFiles/"
# Read the file into a DataFrame: df
df = pd.read_csv(myPath + 'dob_job_application_filings_subset.csv')
# Print the head of df
print(df.head())
# Print the tail of df
print(df.tail())
# Print the shape of df
print(df.shape)
# Print the columns of df
print(df.columns)
# Print the info of df
print(df.info())
# Print the value counts for 'Borough'
print(df['Borough'].value_counts(dropna=False))
# Print the value_counts for 'State'
print(df['State'].value_counts(dropna=False))
# Print the value counts for 'Site Fill'
print(df['Site Fill'].value_counts(dropna=False))
# Import matplotlib.pyplot
import matplotlib.pyplot as plt
# Plot the histogram
df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True)
# Display the histogram
# plt.show()
plt.savefig("_dummyPy045.png", bbox_inches="tight")
plt.clf()
# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt
# Create the boxplot
df.boxplot(column="initial_cost", by="Borough", rot=90)
# Display the plot
# plt.show()
plt.savefig("_dummyPy046.png", bbox_inches="tight")
plt.clf()
# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt
# Create and display the first scatter plot
df.plot(kind="scatter", x="initial_cost", y="total_est_fee", rot=70)
# plt.show()
plt.savefig("_dummyPy047.png", bbox_inches="tight")
plt.clf()
## Unnamed: 0 Borough State Site Fill Existing Zoning Sqft \
## 0 0 BROOKLYN NY USE UNDER 300 CU.YD 0
## 1 1 BROOKLYN NY NaN 0
## 2 2 MANHATTAN NY NOT APPLICABLE 0
## 3 3 QUEENS NY NOT APPLICABLE 0
## 4 4 BROOKLYN NY NOT APPLICABLE 0
##
## Initial Cost Total Est. Fee initial_cost total_est_fee
## 0 $0.00 $420.00 0.0 420.0
## 1 $0.00 $170.00 0.0 170.0
## 2 $60000.00 $831.50 60000.0 831.5
## 3 $31000.00 $692.80 31000.0 692.8
## 4 $3000.00 $225.00 3000.0 225.0
## Unnamed: 0 Borough State Site Fill Existing Zoning Sqft \
## 138 138 QUEENS NY NaN 0
## 139 139 QUEENS NY NOT APPLICABLE 0
## 140 140 BROOKLYN NY NOT APPLICABLE 0
## 141 141 BROOKLYN NY USE UNDER 300 CU.YD 0
## 142 142 BRONX NY NaN 0
##
## Initial Cost Total Est. Fee initial_cost total_est_fee
## 138 $63000.00 $832.40 63000.0 832.4
## 139 $21000.00 $212.40 21000.0 212.4
## 140 $2800.00 $395.00 2800.0 395.0
## 141 $0.00 $472.00 0.0 472.0
## 142 $0.00 $170.00 0.0 170.0
## (143, 9)
## Index(['Unnamed: 0', 'Borough', 'State', 'Site Fill', 'Existing Zoning Sqft',
## 'Initial Cost', 'Total Est. Fee', 'initial_cost', 'total_est_fee'],
## dtype='object')
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 143 entries, 0 to 142
## Data columns (total 9 columns):
## Unnamed: 0 143 non-null int64
## Borough 143 non-null object
## State 143 non-null object
## Site Fill 120 non-null object
## Existing Zoning Sqft 143 non-null int64
## Initial Cost 143 non-null object
## Total Est. Fee 143 non-null object
## initial_cost 143 non-null float64
## total_est_fee 143 non-null float64
## dtypes: float64(2), int64(2), object(5)
## memory usage: 7.3+ KB
## None
## MANHATTAN 66
## BROOKLYN 44
## QUEENS 16
## STATEN ISLAND 10
## BRONX 7
## Name: Borough, dtype: int64
## NY 136
## NJ 6
## NC 1
## Name: State, dtype: int64
## NOT APPLICABLE 108
## NaN 23
## USE UNDER 300 CU.YD 8
## ON-SITE 4
## Name: Site Fill, dtype: int64
NYC Open Data Sub-sample (Building Permits - Existing Zoning Sq Ft):
NYC Open Data Sub-sample (Building Permits - Initial Cost by Borough):
NYC Open Data Sub-sample (Building Permits):
Chapter 2 - Tidying data for analysis
Tidy data per the Hadley Wickham paper - “standard way to organize data within a dataset”:
Pivoting data is the opposite of melting; turn unique values in to separate columns (assuming again that the DataFrame, df, already exists):
Beyond melt and pivot - example from the Wickham data of having a single variable that combines sex and age-group (TB data) - common shape for reporting, but less than ideal for analysis:
Example code includes:
# THIS SEEMS TO BE THE STANARD R datasets file as a pandas
# Saved airquality.csv to the ./PythonInputFiles
myPath = "./PythonInputFiles/"
import pandas as pd
import numpy as np
airquality = pd.read_csv(myPath + "airquality.csv")
# Print the head of airquality
print(airquality.head())
# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=["Month", "Day"])
# Print the head of airquality_melt
print(airquality_melt.head())
# Print the head of airquality
print(airquality.head())
# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=["Month", "Day"], var_name="measurement", value_name="reading")
# Print the head of airquality_melt
print(airquality_melt.head())
# Print the head of airquality_melt
print(airquality_melt.head())
# airquality_melt.pivot() would bomb out on this; not sure why . . . (may be due to having 2+ variables in the index
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=["Month", "Day"], columns="measurement", values="reading")
# Print the head of airquality_pivot
print(airquality_pivot.head())
# Print the index of airquality_pivot
print(airquality_pivot.index)
# Reset the index of airquality_pivot: airquality_pivot
airquality_pivot = airquality_pivot.reset_index()
# Print the new index of airquality_pivot
print(airquality_pivot.index)
# Print the head of airquality_pivot
print(airquality_pivot.head())
# Pivot airquality_dup: airquality_pivot
# keyRows = [x for x in range(len(airquality.index))] + [2, 4, 6, 8, 10]
# airquality_dup = airquality.iloc[keyRows, :]
airquality_pivot = airquality_melt.pivot_table(index=["Month", "Day"], columns="measurement", values="reading", aggfunc=np.mean)
# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()
# Print the head of airquality_pivot
print(airquality_pivot.head())
# Print the head of airquality
print(airquality.head())
# tb is 201x18 with variables ['country', 'year', 'm014', 'm1524', 'm2534', 'm3544', 'm4554', 'm5564', 'm65', 'mu', 'f014', 'f1524', 'f2534', 'f3544', 'f4554', 'f5564', 'f65', 'fu']
# year is set to be always 2000 with fu and mu always NaN
# Create dummy data for tb (just use 3 countries and the 014 and 1524 columns)
tb = pd.DataFrame( { "country":["USA", "CAN", "MEX"] , "year":2000 , "fu":np.nan , "mu":np.nan , "f014":[2, 3, 4] , "m014":[5, 6, 7] , "f1524": [8, 9, 0] , "m1524":[1, 2, 3] } )
# Melt tb: tb_melt
tb_melt = pd.melt(tb, id_vars=["country", "year"])
# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]
# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]
# Print the head of tb_melt
print(tb_melt.head()) # Is now 3,216 x 6 ['country', 'year', 'variable', 'value', 'gender', 'age_group']
# Ebola dataset is available at https://data.humdata.org/dataset/ebola-cases-2014
# Variables are split by an underscore 'Date', 'Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone', 'Cases_Nigeria', 'Cases_Senegal', 'Cases_UnitedStates', 'Cases_Spain', 'Cases_Mali', 'Deaths_Guinea', 'Deaths_Liberia', 'Deaths_SierraLeone', 'Deaths_Nigeria', 'Deaths_Senegal', 'Deaths_UnitedStates', 'Deaths_Spain', 'Deaths_Mali'
# Downloaded file, then manipulated to be like the above as follows:
# ebola_test = pd.read_csv(myPath + "ebola_data_db_format.csv")
# ebola_test["UseCountry"] = ebola_test["Country"].str.replace(" ", "")
# ebola_test["UseCountry"] = ebola_test["UseCountry"].str.replace("2", "")
# keyIndic = ["Cumulative number of confirmed Ebola deaths", "Cumulative number of confirmed Ebola cases"]
# keyBool = [x in keyIndic for x in ebola_test["Indicator"]]
# ebola_test = ebola_test.loc[keyBool, :]
# indicMap = {keyIndic[0]:"Deaths", keyIndic[1]:"Cases"}
# ebola_test["UseIndicator"] = ebola_test["Indicator"].map(indicMap)
# ebolaPre = ebola_test[["Date", "UseCountry", "UseIndicator", "value"]]
# ebolaPre["CI"] = ebolaPre["UseIndicator"] + "_" + ebolaPre["UseCountry"]
# ebolaSave = ebolaPre.pivot_table(index="Date", columns="CI", values="value", aggfunc="max").fillna(method="ffill").fillna(0)
# ebolaSave.to_csv(myPath + "ebola.csv")
ebola = pd.read_csv(myPath + "ebola.csv", parse_dates=["Date"])
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=["Date"], var_name="type_country", value_name="counts")
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt["type_country"].str.split("_")
# Create the 'type' column
ebola_melt['type'] = ebola_melt['str_split'].str.get(0)
# Create the 'country' column
ebola_melt['country'] = ebola_melt['str_split'].str.get(1)
# Print the head of ebola_melt
print(ebola_melt.head())
# ebola_melt.to_csv(myPath + "ebola_melt.csv", index=False)
# Run outside of this shell so that the file is accessible later
## Ozone Solar.R Wind Temp Month Day
## 0 41.0 190.0 7.4 67 5 1
## 1 36.0 118.0 8.0 72 5 2
## 2 12.0 149.0 12.6 74 5 3
## 3 18.0 313.0 11.5 62 5 4
## 4 NaN NaN 14.3 56 5 5
## Month Day variable value
## 0 5 1 Ozone 41.0
## 1 5 2 Ozone 36.0
## 2 5 3 Ozone 12.0
## 3 5 4 Ozone 18.0
## 4 5 5 Ozone NaN
## Ozone Solar.R Wind Temp Month Day
## 0 41.0 190.0 7.4 67 5 1
## 1 36.0 118.0 8.0 72 5 2
## 2 12.0 149.0 12.6 74 5 3
## 3 18.0 313.0 11.5 62 5 4
## 4 NaN NaN 14.3 56 5 5
## Month Day measurement reading
## 0 5 1 Ozone 41.0
## 1 5 2 Ozone 36.0
## 2 5 3 Ozone 12.0
## 3 5 4 Ozone 18.0
## 4 5 5 Ozone NaN
## Month Day measurement reading
## 0 5 1 Ozone 41.0
## 1 5 2 Ozone 36.0
## 2 5 3 Ozone 12.0
## 3 5 4 Ozone 18.0
## 4 5 5 Ozone NaN
## measurement Ozone Solar.R Temp Wind
## Month Day
## 5 1 41.0 190.0 67.0 7.4
## 2 36.0 118.0 72.0 8.0
## 3 12.0 149.0 74.0 12.6
## 4 18.0 313.0 62.0 11.5
## 5 NaN NaN 56.0 14.3
## MultiIndex(levels=[[5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
## labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]],
## names=['Month', 'Day'])
## RangeIndex(start=0, stop=153, step=1)
## measurement Month Day Ozone Solar.R Temp Wind
## 0 5 1 41.0 190.0 67.0 7.4
## 1 5 2 36.0 118.0 72.0 8.0
## 2 5 3 12.0 149.0 74.0 12.6
## 3 5 4 18.0 313.0 62.0 11.5
## 4 5 5 NaN NaN 56.0 14.3
## measurement Month Day Ozone Solar.R Temp Wind
## 0 5 1 41.0 190.0 67.0 7.4
## 1 5 2 36.0 118.0 72.0 8.0
## 2 5 3 12.0 149.0 74.0 12.6
## 3 5 4 18.0 313.0 62.0 11.5
## 4 5 5 NaN NaN 56.0 14.3
## Ozone Solar.R Wind Temp Month Day
## 0 41.0 190.0 7.4 67 5 1
## 1 36.0 118.0 8.0 72 5 2
## 2 12.0 149.0 12.6 74 5 3
## 3 18.0 313.0 11.5 62 5 4
## 4 NaN NaN 14.3 56 5 5
## country year variable value gender age_group
## 0 USA 2000 f014 2.0 f 014
## 1 CAN 2000 f014 3.0 f 014
## 2 MEX 2000 f014 4.0 f 014
## 3 USA 2000 f1524 8.0 f 1524
## 4 CAN 2000 f1524 9.0 f 1524
## Date type_country counts str_split type country
## 0 2014-08-29 Cases_Guinea 482.0 [Cases, Guinea] Cases Guinea
## 1 2014-09-05 Cases_Guinea 604.0 [Cases, Guinea] Cases Guinea
## 2 2014-09-08 Cases_Guinea 664.0 [Cases, Guinea] Cases Guinea
## 3 2014-09-12 Cases_Guinea 678.0 [Cases, Guinea] Cases Guinea
## 4 2014-09-16 Cases_Guinea 743.0 [Cases, Guinea] Cases Guinea
Chapter 3 - Combining data for analysis
Concatenating data - data may be in separate files (too many records, time series data by day, etc.), while you want to combine it:
Finding and concatenating data - issue of many files needing to be concatenated:
Merge data - extension on concatenation (which is more piecing something back together that was originally one piece but became split):
Example code includes:
myPath = "./PythonInputFiles/"
import pandas as pd
import numpy as np
# uber datasets are a small subset from within http://data.beta.nyc/dataset/uber-trip-data-foiled-apr-sep-2014
# downloaded file "Uber-Jan-Feb-FOIL.csv" to myPath
uber = pd.read_csv(myPath + "Uber-Jan-Feb-FOIL.csv")
cuts = [round(len(uber.index) / 3), round(2 * len(uber.index) / 3)]
uber1 = uber.iloc[:cuts[0], :]
uber2 = uber.iloc[cuts[0]:cuts[1], :]
uber3 = uber.iloc[cuts[1]:, :]
# Save outside of this routine
# uber1.to_csv(myPath + "uber1.csv", index=False)
# uber2.to_csv(myPath + "uber2.csv", index=False)
# uber3.to_csv(myPath + "uber3.csv", index=False)
# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1, uber2, uber3])
# Print the shape of row_concat
print(row_concat.shape)
# Print the head of row_concat
print(row_concat.head())
print(np.sum(row_concat != uber))
# ebola_melt is 1,952x4 of Date-Day-status_country-counts
# status_country is 1,952x2 of status-country (the previous status_country has been string split)
# Create this from the file in the previous exercise
ebola_melt = pd.read_csv(myPath + "ebola_melt.csv", parse_dates=["Date"])
ebola_melt.columns = ["Date", "status_country", "counts", "str_split", "status", "country"]
status_country = ebola_melt[["status", "country"]]
ebola_melt = ebola_melt[["Date", "status_country", "counts"]]
# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt, status_country], axis=1)
# Print the shape of ebola_tidy
print(ebola_tidy.shape)
# Print the head of ebola_tidy
print(ebola_tidy.head())
# Has files ['uber-raw-data-2014_06.csv', 'uber-raw-data-2014_04.csv', 'uber-raw-data-2014_05.csv'] available
# Date/Time-Lat-Lon-Base
# Import necessary modules
import glob
import pandas as pd
# Write the pattern: pattern
# This is designed to get the uber1.csv, uber2.csv, and uber3.csv files
pattern = myPath + 'uber?.csv'
# Save all file matches: csv_files
csv_files = glob.glob(pattern)
# Print the file names
print(csv_files)
# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])
# Print the head of csv2
print(csv2.head())
# Create an empty list: frames
frames = []
# Iterate over csv_files
for csv in csv_files:
# Read csv into a DataFrame: df
df = pd.read_csv(csv)
# Append df to frames
frames.append(df)
# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)
# Print the shape of uber
print(uber.shape)
# Print the head of uber
print(uber.head())
# site is a 3x3 with name-lat-long - name=["DR-1", "DR-3", "MSK-4"], lat=[-50, -47, -48.9], lon=[-129, -127, -123.4]
# visited is a 3x3 with ident-site-dated - ident=[619, 734, 837], site=["DR-1", "DR-3", "MSK-4"], dated=["1927-02", "1939-01", "1932-01"]
site = pd.DataFrame( { "name":["DR-1", "DR-3", "MSK-4"], "lat":[-50, -47, -48.9], "lon":[-129, -127, -123.4] } )
visited = pd.DataFrame( { "ident":[619, 734, 837], "site":["DR-1", "DR-3", "MSK-4"], "dated":["1927-02", "1939-01", "1932-01"] } )
# Merge the DataFrames: o2o
o2o = pd.merge(left=site, right=visited, left_on=["name"], right_on=["site"])
# Print o2o
print(o2o)
# now make visited 8x3 with ident=[619, 622, 734, 735, , 751, 752, 837, 844], site=['DR-1', 'DR-1', 'DR-3', 'DR-3', 'DR-3', 'DR-3', 'MSK-4', 'DR-1'], dated=['1927-02-08', '1927-02-10', '1939-01-07', '1930-01-12', '1930-02-26', nan, '1932-01-14', '1932-03-22']
visited = pd.DataFrame( {"ident":[619, 622, 734, 735, 751, 752, 837, 844], "site":['DR-1', 'DR-1', 'DR-3', 'DR-3', 'DR-3', 'DR-3', 'MSK-4', 'DR-1'], "dated":['1927-02-08', '1927-02-10', '1939-01-07', '1930-01-12', '1930-02-26', np.nan, '1932-01-14', '1932-03-22']} )
# Merge the DataFrames: m2o
m2o = pd.merge(left=site, right=visited, left_on=["name"], right_on=["site"])
# Print m2o
print(m2o)
# add an additional frame surveyed which is 21x4 with taken-person-quant-reading (taken matched ident in file visited)
# Merge site and visited: m2m
# m2m = pd.merge(left=site, right=visited, left_on=["name"], right_on=["site"])
# Merge m2m and survey: m2m
# m2m = pd.merge(left=m2m, right=survey, left_on=["ident"], right_on=["taken"])
# Print the first 20 lines of m2m
# print(m2m.head(20))
## (354, 4)
## dispatching_base_number date active_vehicles trips
## 0 B02512 1/1/2015 190 1132
## 1 B02765 1/1/2015 225 1765
## 2 B02764 1/1/2015 3427 29421
## 3 B02682 1/1/2015 945 7679
## 4 B02617 1/1/2015 1228 9537
## dispatching_base_number 0
## date 0
## active_vehicles 0
## trips 0
## dtype: int64
## (5180, 5)
## Date status_country counts status country
## 0 2014-08-29 Cases_Guinea 482.0 Cases Guinea
## 1 2014-09-05 Cases_Guinea 604.0 Cases Guinea
## 2 2014-09-08 Cases_Guinea 664.0 Cases Guinea
## 3 2014-09-12 Cases_Guinea 678.0 Cases Guinea
## 4 2014-09-16 Cases_Guinea 743.0 Cases Guinea
## ['./PythonInputFiles\\uber1.csv', './PythonInputFiles\\uber2.csv', './PythonInputFiles\\uber3.csv']
## dispatching_base_number date active_vehicles trips
## 0 B02765 1/20/2015 272 1608
## 1 B02617 1/20/2015 1350 10015
## 2 B02764 1/21/2015 3718 27344
## 3 B02512 1/21/2015 242 1519
## 4 B02682 1/21/2015 1228 9472
## (354, 4)
## dispatching_base_number date active_vehicles trips
## 0 B02512 1/1/2015 190 1132
## 1 B02765 1/1/2015 225 1765
## 2 B02764 1/1/2015 3427 29421
## 3 B02682 1/1/2015 945 7679
## 4 B02617 1/1/2015 1228 9537
## lat lon name dated ident site
## 0 -50.0 -129.0 DR-1 1927-02 619 DR-1
## 1 -47.0 -127.0 DR-3 1939-01 734 DR-3
## 2 -48.9 -123.4 MSK-4 1932-01 837 MSK-4
## lat lon name dated ident site
## 0 -50.0 -129.0 DR-1 1927-02-08 619 DR-1
## 1 -50.0 -129.0 DR-1 1927-02-10 622 DR-1
## 2 -50.0 -129.0 DR-1 1932-03-22 844 DR-1
## 3 -47.0 -127.0 DR-3 1939-01-07 734 DR-3
## 4 -47.0 -127.0 DR-3 1930-01-12 735 DR-3
## 5 -47.0 -127.0 DR-3 1930-02-26 751 DR-3
## 6 -47.0 -127.0 DR-3 NaN 752 DR-3
## 7 -48.9 -123.4 MSK-4 1932-01-14 837 MSK-4
Chapter 4 - Cleaning data for analysis
Data types and conversions - can see the data types using the df.dtypes attribute of a pandas DataFrame df:
Using regular expressions to clean strings - the most common form of data cleaning is string manipulation:
Using functions to clean data - in particular, the .apply() function:
Duplicate and missing data - can skew results in undesirable manners:
Testing with asserts - early detection for problems that may plague the analysis later:
Example code includes:
# The tips data is available at https://github.com/mwaskom/seaborn-data/blob/master/tips.csv
myPath = "./PythonInputFiles/"
import pandas as pd
import numpy as np
tips = pd.read_csv(myPath + "tips.csv")
# Convert the sex column to type 'category'
tips.sex = tips["sex"].astype("category")
# Convert the smoker column to type 'category'
tips.smoker = tips["smoker"].astype("category")
# Print the info of tips
print(tips.info())
# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips["total_bill"], errors="coerce")
# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips["tip"], errors="coerce")
# Print the info of tips
print(tips.info())
# Import the regular expression module
import re
# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')
# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))
# See if the pattern matches
result = prog.match("1123-456-7890")
print(bool(result))
# Import the regular expression module
import re
# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')
# Print the matches
print(matches)
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)
# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)
# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
print(pattern3)
import numpy
# Define recode_sex()
def recode_sex(sex_value):
# Return 1 if sex_value is 'Male'
if sex_value == "Male":
return 1
# Return 0 if sex_value is 'Female'
elif sex_value == "Female":
return 0
# Return np.nan
else:
return np.nan
# Apply the function to the sex column
tips['sex_recode'] = tips["sex"].apply(recode_sex)
# Create the total_dollar field
tips["total_dollar"] = "$" + tips["total_bill"].astype(str)
# Write the lambda function using replace
tips['total_dollar_replace'] = tips["total_dollar"].apply(lambda x: x.replace('$', ''))
# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips["total_dollar"].apply(lambda x: re.findall('\d+\.\d+', x))
# Print the head of tips
print(tips.head())
# DO NOT HAVE DATASET "tracks"
# Create the new DataFrame: tracks
# tracks = billboard[['year', 'artist', 'track', 'time']]
# Print info of tracks
# print(tracks.info())
# Drop the duplicates: tracks_no_duplicates
# tracks_no_duplicates = tracks.drop_duplicates()
# Print info of tracks
# print(tracks_no_duplicates.info())
# SEEMS TO BE "airquality" as per the R datasets package
# Previously saved as myPath + "airquality.csv"
airquality = pd.read_csv(myPath + "airquality.csv")
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality["Ozone"].mean()
# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality["Ozone"].fillna(oz_mean)
# Print the info of airquality
print(airquality.info())
# DO NOT HAVE FRAME ebola - 122 x 18 of Date-Day-Cases_[8 countries]-Deaths_[8 countries]
# Use the version saved previously
ebola = pd.read_csv(myPath + "ebola.csv", parse_dates=["Date"])
# Assert that there are no missing values
assert ebola.notnull().all().all()
# Assert that all values are >= 0
assert (ebola >= 0).all().all()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 244 entries, 0 to 243
## Data columns (total 7 columns):
## total_bill 244 non-null float64
## tip 244 non-null float64
## sex 244 non-null category
## smoker 244 non-null category
## day 244 non-null object
## time 244 non-null object
## size 244 non-null int64
## dtypes: category(2), float64(2), int64(1), object(2)
## memory usage: 8.2+ KB
## None
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 244 entries, 0 to 243
## Data columns (total 7 columns):
## total_bill 244 non-null float64
## tip 244 non-null float64
## sex 244 non-null category
## smoker 244 non-null category
## day 244 non-null object
## time 244 non-null object
## size 244 non-null int64
## dtypes: category(2), float64(2), int64(1), object(2)
## memory usage: 8.2+ KB
## None
## True
## False
## ['10', '1']
## True
## True
## True
## total_bill tip sex smoker day time size sex_recode total_dollar \
## 0 16.99 1.01 Female No Sun Dinner 2 0 $16.99
## 1 10.34 1.66 Male No Sun Dinner 3 1 $10.34
## 2 21.01 3.50 Male No Sun Dinner 3 1 $21.01
## 3 23.68 3.31 Male No Sun Dinner 2 1 $23.68
## 4 24.59 3.61 Female No Sun Dinner 4 0 $24.59
##
## total_dollar_replace total_dollar_re
## 0 16.99 [16.99]
## 1 10.34 [10.34]
## 2 21.01 [21.01]
## 3 23.68 [23.68]
## 4 24.59 [24.59]
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 153 entries, 0 to 152
## Data columns (total 6 columns):
## Ozone 153 non-null float64
## Solar.R 146 non-null float64
## Wind 153 non-null float64
## Temp 153 non-null int64
## Month 153 non-null int64
## Day 153 non-null int64
## dtypes: float64(3), int64(3)
## memory usage: 7.2 KB
## None
Chapter 5 - Case Study
Putting it all together - Gapminder data (NPO supporting global sustainable development):
Initial impressions of the data - depending on the analysis needs, can melt (columns to rows) or pivot (new columns from column data) the data:
Example code includes:
myPath = "./PythonInputFiles/"
# The DataFrame g1800s is a life expectancy table of 260 x 101 - "Life Expectancy" (country) followed by "1800" through "1899"
# Copied data from https://docs.google.com/spreadsheets/d/1H3nzTwbn8z4lJ5gJ_WfDgCeGEXK3PVGcNjQ_U5og8eo/pub as accessed from http://www.gapminder.org/data/ to myPath + "gapminder_lifeExp_1800_1916.xlsx"
import pandas as pd
gapExcel = pd.read_excel(myPath + "gapminder_lifeExp_1800_1916.xlsx")
# Convert column labels to text
gapExcel.columns = gapExcel.columns.astype(str)
assert gapExcel.columns[0] == "Life expectancy"
# Create booleans for 1800s, 1900s, and 2000s, including "Life expectancy" (country columns) as true in all
col1800s = gapExcel.columns.str.startswith("18")
col1900s = gapExcel.columns.str.startswith("19")
col2000s = gapExcel.columns.str.startswith("20")
col1800s[0] = True
col1900s[0] = True
col2000s[0] = True
# Create g1800s, g1900s, g2000s
g1800s = gapExcel.loc[:, col1800s]
g1900s = gapExcel.loc[:, col1900s]
g2000s = gapExcel.loc[:, col2000s]
# Import matplotlib.pyplot
import matplotlib.pyplot as plt
# Create the scatter plot
g1800s.plot(kind="scatter", x="1800", y="1899")
# Specify axis labels
plt.xlabel('Life Expectancy by Country in 1800')
plt.ylabel('Life Expectancy by Country in 1899')
# Specify axis limits
plt.xlim(20, 55)
plt.ylim(20, 55)
# Display the plot
# plt.show()
plt.savefig("_dummyPy048.png", bbox_inches="tight")
plt.clf()
import pandas as pd
import numpy as np
def check_null_or_valid(row_data):
"""Function that takes a row of data,
drops all missing values,
and checks if all remaining values are greater than or equal to 0
"""
no_na = row_data.dropna()[1:-1]
numeric = pd.to_numeric(no_na)
ge0 = numeric >= 0
return ge0
# Check whether the first column is 'Life expectancy'
assert g1800s.columns[0] == "Life expectancy"
# Check whether the values in the row are valid
assert g1800s.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all()
# Check that there is only one instance of each country
assert g1800s['Life expectancy'].value_counts()[0] == 1
# Also frames g1900s as 260x101 and g2000s as 260x18
# Concatenate the DataFrames row-wise
gapminder = pd.concat([g1800s, g1900s, g2000s])
# Print the shape of gapminder
print(gapminder.shape)
# Print the head of gapminder
print(gapminder.head())
# Melt gapminder: gapminder_melt
gapminder_melt = pd.melt(gapminder, id_vars="Life expectancy")
# Rename the columns
gapminder_melt.columns = ['country', 'year', 'life_expectancy']
# Print the head of gapminder_melt
print(gapminder_melt.head())
# Exercises used gapminder_melt as gapminder - keep copy before over-writing in case needed later
gapminder_old = gapminder.loc[:, :]
gapminder = gapminder_melt.loc[:, :]
# Convert the year column to numeric
gapminder.year = pd.to_numeric(gapminder.year)
# Test if country is of type object
assert gapminder.country.dtypes == np.object
# Test if year is of type int64
assert gapminder.year.dtypes == np.int64
# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64
# Create the series of countries: countries
countries = gapminder["country"]
# Drop all the duplicates from countries
countries = countries.drop_duplicates()
# Write the regular expression: pattern
pattern = '^[A-Za-z\.\s]*$'
# Create the Boolean vector: mask
mask = countries.str.contains(pattern)
# Invert the mask: mask_inverse
mask_inverse = ~mask # The ~ is for inversion
# Subset countries using mask_inverse: invalid_countries
invalid_countries = countries.loc[mask_inverse]
# Print invalid_countries
print(invalid_countries)
# Assert that country does not contain any missing values
assert pd.notnull(gapminder.country).all()
# Assert that year does not contain any missing values
assert pd.notnull(gapminder.year).all()
# Print the shape of gapminder (prior to dropping NaN)
print(gapminder.shape)
# Drop the missing values
gapminder = gapminder.dropna()
# Print the shape of gapminder (after dropping NaN)
print(gapminder.shape)
# Add first subplot
plt.subplot(2, 1, 1)
# Create a histogram of life_expectancy
gapminder["life_expectancy"].plot(kind="hist")
# Group gapminder: gapminder_agg
gapminder_agg = gapminder.groupby(by="year")["life_expectancy"].mean()
# Print the head of gapminder_agg
print(gapminder_agg.head())
# Print the tail of gapminder_agg
print(gapminder_agg.tail())
# Add second subplot
plt.subplot(2, 1, 2)
# Create a line plot of life expectancy per year
gapminder_agg.plot()
# Add title and specify axis labels
plt.title('Life expectancy over the years')
plt.ylabel('Life expectancy')
plt.xlabel('Year')
# Display the plots
plt.tight_layout()
# plt.show()
plt.savefig("_dummyPy049.png", bbox_inches="tight")
plt.clf()
# Save both DataFrames to csv files
gapminder.to_csv(myPath + "gapminder.csv")
gapminder_agg.to_csv(myPath + "gapminder_agg.csv")
## (780, 218)
## 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 \
## 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## 1 28.21 28.20 28.19 28.18 28.17 28.16 28.15 28.14 28.13 28.12
## 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## 3 35.40 35.40 35.40 35.40 35.40 35.40 35.40 35.40 35.40 35.40
## 4 28.82 28.82 28.82 28.82 28.82 28.82 28.82 28.82 28.82 28.82
##
## ... 2008 2009 2010 2011 2012 2013 2014 2015 \
## 0 ... NaN NaN NaN NaN NaN NaN NaN NaN
## 1 ... NaN NaN NaN NaN NaN NaN NaN NaN
## 2 ... NaN NaN NaN NaN NaN NaN NaN NaN
## 3 ... NaN NaN NaN NaN NaN NaN NaN NaN
## 4 ... NaN NaN NaN NaN NaN NaN NaN NaN
##
## 2016 Life expectancy
## 0 NaN Abkhazia
## 1 NaN Afghanistan
## 2 NaN Akrotiri and Dhekelia
## 3 NaN Albania
## 4 NaN Algeria
##
## [5 rows x 218 columns]
## country year life_expectancy
## 0 Abkhazia 1800 NaN
## 1 Afghanistan 1800 28.21
## 2 Akrotiri and Dhekelia 1800 NaN
## 3 Albania 1800 35.40
## 4 Algeria 1800 28.82
## 49 Congo, Dem. Rep.
## 50 Congo, Rep.
## 53 Cote d'Ivoire
## 73 Falkland Is (Malvinas)
## 93 Guinea-Bissau
## 98 Hong Kong, China
## 118 United Korea (former)
## 131 Macao, China
## 132 Macedonia, FYR
## 145 Micronesia, Fed. Sts.
## 161 Ngorno-Karabakh
## 187 St. Barthélemy
## 193 St.-Pierre-et-Miquelon
## 225 Timor-Leste
## 251 Virgin Islands (U.S.)
## 252 North Yemen (former)
## 253 South Yemen (former)
## 258 Åland
## Name: country, dtype: object
## (169260, 3)
## (43857, 3)
## year
## 1800 31.486020
## 1801 31.448905
## 1802 31.463483
## 1803 31.377413
## 1804 31.446318
## Name: life_expectancy, dtype: float64
## year
## 2012 71.663077
## 2013 71.916106
## 2014 72.088125
## 2015 72.321010
## 2016 72.556635
## Name: life_expectancy, dtype: float64
Gapminder Life Expectancy by Country (1899 vs 1800):
Gapminder Life Expectancy:
Chapter 1 - Data Ingestion and Inspection
Review of pandas data frames - tabular data structure with labelled rows and columns:
Building DataFrames from scratch:
Importing and exporting data - example using ISSN_D_tot.csv, sunspot data:
Plotting with pandas - can plot either the panda Series or the underlying numpy array - plt.plot() followed by plt.show() works on either/both:
Example code includes:
myPath = "./PythonInputFiles/"
# NEED TO CREATE FRAME df - "Total Population" - [3034970564.0, 3684822701.0, 4436590356.0, 5282715991.0, 6115974486.0, 6924282937.0] indexed by "Year" [1960, 1970, 1980, 1990, 2000, 2010]
# Import numpy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame( {"Total Population":[3034970564.0, 3684822701.0, 4436590356.0, 5282715991.0, 6115974486.0, 6924282937.0], "Year":[1960, 1970, 1980, 1990, 2000, 2010]} )
df.index = df["Year"]
del df["Year"]
world_population = df.copy()
# Create array of DataFrame values: np_vals
np_vals = df.values
# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)
# Create array of new DataFrame by passing df to np.log10(): df_log10
df_log10 = np.log10(df)
# Print original and new data containers
print(type(np_vals), type(np_vals_log10))
print(type(df), type(df_log10))
list_keys = ['Country', 'Total']
list_values = [['United States', 'Soviet Union', 'United Kingdom'], [1118, 473, 273]]
# Zip the 2 lists together into one list of (key,value) tuples: zipped
zipped = list(zip(list_keys, list_values))
# Inspect the list using print()
print(zipped)
# Build a dictionary with the zipped list: data
data = dict(zipped)
# Build and inspect a DataFrame from the dictionary: df
df = pd.DataFrame(data)
print(df)
tempDict = {"a":[1980, 1981, 1982] , "b":["Blondie", "Chris Cross", "Joan Jett"] , "c":["Call Me", "Arthurs Theme", "I Love Rock and Roll"], "d":[6, 3, 7]}
df = pd.DataFrame(tempDict)
# Build a list of labels: list_labels
list_labels = ['year', 'artist', 'song', 'chart weeks']
# Assign the list of labels to the columns attribute: df.columns
df.columns = list_labels
print(df)
cities = ['Manheim', 'Preston park', 'Biglerville', 'Indiana', 'Curwensville', 'Crown', 'Harveys lake', 'Mineral springs', 'Cassville', 'Hannastown', 'Saltsburg', 'Tunkhannock', 'Pittsburgh', 'Lemasters', 'Great bend']
# Make a string with the value 'PA': state
state = "PA"
# Construct a dictionary: data
data = {'state':state, 'city':cities}
# Construct a DataFrame from dictionary data: df
df = pd.DataFrame(data)
# Print the DataFrame
print(df)
# "world_population.csv is the same 6x2 population data as per the above
# Read in the file: df1
# df1 = pd.read_csv("world_population.csv")
# Skipped this part
# Create a list of the new column labels: new_labels
# new_labels = ["year", "population"]
# Read in the file, specifying the header and names parameters: df2
# df2 = pd.read_csv('world_population.csv', header=0, names=new_labels)
# Skipped this step
# Print both the DataFrames
# print(df1)
# print(df2)
# DO NOT HAVE the messy data - file_messy is "messy_stock_data.tsv"
# Read the raw file as-is: df1
# df1 = pd.read_csv(file_messy)
# Print the output of df1.head()
# print(df1.head())
# Read in the file with the correct parameters: df2
# df2 = pd.read_csv(file_messy, delimiter="\t", header=3, comment="#")
# Print the output of df2.head()
# print(df2.head())
# Save the cleaned up DataFrame to a CSV file without the index
# df2.to_csv(file_clean, index=False)
# Save the cleaned up DataFrame to an excel file without the index
# df2.to_excel('file_clean.xlsx', index=False)
# DO NOT HAVE DataFrame df, which is a 744x1 of "Temperature (deg F)" indexed automatically as 0-743
# Downloaded raw METAR data for KAUS using 0801100000 UTC - 0831102359 UTC
# Coded to a cleaned CSV as per below
#
#
# metarList = []
# for line in open(myPath + "KAUS_Metar_Aug2010.txt", "r"): metarList.append(line.rstrip())
# cleanMetar = []
# cleanLine = ""
# for recs in metarList:
# if recs.startswith("#") or recs == "" : continue
# if recs.startswith("2") :
# if cleanLine != "" :
# cleanMetar.append(cleanLine)
# cleanLine = recs
# else:
# cleanLine = cleanLine + " " + recs.strip()
#
# cleanMetar.append(cleanLine)
#
# useMetar = [textBlock for textBlock in cleanMetar if "METAR" in textBlock]
# useSpeci = [textBlock for textBlock in cleanMetar if "SPECI" in textBlock]
# assert len(cleanMetar) == len(useMetar) + len(useSpeci)
#
# import re
#
# metTime = []
# tempF = []
# dewF = []
# altMG = []
#
# for textBlock in useMetar:
# if textBlock.endswith("NIL="):
# print("Not using line", textBlock)
# continue
#
# # print(textBlock)
# dateUTC = textBlock.split()[0]
#
# tempData = re.findall("T([0-9][0-9][0-9][0-9])([0-9][0-9][0-9][0-9])", textBlock)
# assert len(tempData) == 1
# a, b = tempData[0]
# tempC = float(a[1:])/10
# dewC = float(b[1:])/10
# if a[0] == "1" : tempC = -tempC
# if b[0] == "1" : dewC = -dewC
#
# tF = round((9/5) * tempC + 32, 0)
# dF = round((9/5) * dewC + 32, 0)
#
# altData = re.findall("A([0-9][0-9][0-9][0-9])", textBlock)
# assert len(altData) == 1
#
# aMG = float(altData[0]) / 100
# # print(dateUTC, tempC, dewC, altMG, tempF, dewF)
#
# metTime.append(dateUTC)
# tempF.append(tF)
# dewF.append(dF)
# altMG.append(aMG)
#
# metarKAUS = pd.DataFrame( {"DateTime (UTC)":metTime, "Temperature (deg F)":tempF , "Dew Point (deg F)":dewF, "Pressure (atm)":altMG} )
# metarKAUS.index = metarKAUS["DateTime (UTC)"]
# del metarKAUS["DateTime (UTC)"]
#
# metarKAUS.to_csv(myPath + "KAUS_Metar_Aug2010_Clean.csv")
# Create or import the data
# import random
# df = pd.DataFrame( {"Temperature (deg F)":np.random.randint(low=60, high=100, size=744)} )
dfFull = pd.read_csv(myPath + "KAUS_Metar_Aug2010_Clean.csv")
df = dfFull.loc[:, "Temperature (deg F)"]
# Create a plot with color='red'
df.plot(color="red")
# Add a title
plt.title('Temperature in Austin')
# Specify the x-axis label
plt.xlabel('Hours since midnight August 1, 2010')
# Specify the y-axis label
plt.ylabel('Temperature (degrees F)')
# Display the plot
# plt.show()
plt.savefig("_dummyPy050.png", bbox_inches="tight")
plt.clf()
# DO NOT HAVE DataFrame df, which is a 744x3 of "Temperature (deg F)", "Dew Point (deg F)", "Pressure (atm)" indexed automatically as 0-743
# df["Dew Point (deg F)"] = df.iloc[:, 0] + np.random.randint(low=-30, high=0, size=744)
# df["Pressure (atm)"] = np.random.randint(low=980, high=1020, size=744)
# Use dfFull rather than manufacturing data
df = dfFull.copy()
df.index = [x[6:8] + "-" + "{0:0>2}".format(str(int(x[9:10]) + 1)) + "Z" for x in df["DateTime (UTC)"].astype(str)]
del df["DateTime (UTC)"]
# Plot all columns (default)
df.plot()
# plt.show()
plt.savefig("_dummyPy051.png", bbox_inches="tight")
plt.clf()
# Plot all columns as subplots
df.plot(subplots=True)
# plt.show()
plt.savefig("_dummyPy052.png", bbox_inches="tight")
plt.clf()
# Plot just the Dew Point data
column_list1 = ['Dew Point (deg F)']
df[column_list1].plot()
# plt.show()
plt.savefig("_dummyPy053.png", bbox_inches="tight")
plt.clf()
# Plot the Dew Point and Temperature data, but not the Pressure data
column_list2 = ['Temperature (deg F)','Dew Point (deg F)']
df[column_list2].plot()
# plt.show()
plt.savefig("_dummyPy054.png", bbox_inches="tight")
plt.clf()
## <class 'numpy.ndarray'> <class 'numpy.ndarray'>
## <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
## [('Country', ['United States', 'Soviet Union', 'United Kingdom']), ('Total', [1118, 473, 273])]
## Country Total
## 0 United States 1118
## 1 Soviet Union 473
## 2 United Kingdom 273
## year artist song chart weeks
## 0 1980 Blondie Call Me 6
## 1 1981 Chris Cross Arthurs Theme 3
## 2 1982 Joan Jett I Love Rock and Roll 7
## city state
## 0 Manheim PA
## 1 Preston park PA
## 2 Biglerville PA
## 3 Indiana PA
## 4 Curwensville PA
## 5 Crown PA
## 6 Harveys lake PA
## 7 Mineral springs PA
## 8 Cassville PA
## 9 Hannastown PA
## 10 Saltsburg PA
## 11 Tunkhannock PA
## 12 Pittsburgh PA
## 13 Lemasters PA
## 14 Great bend PA
Temperature - Austin, TX (Aug 2010):
METAR plots - Austin, TX (Aug 2010):
METAR Sub-plots - Austin, TX (Aug 2010):
Dew Point - Austin, TX (Aug 2010):
Temperature and Dew Point - Austin, TX (Aug 2010):
Chapter 2 - Exploratory Data Analysis
Visual exploratory data analysis - using Fisher’s iris flower data (similar to the R dataset):
Statistical exploratory data analysis - starting with the .describe() method which is very similar to summary() in R - counts, means, quartiles, and the like:
Separating populations with boolean indexing - subsets of columns and/or rows for plotting, summarizing, and the like:
Example code includes:
myPath = "./PythonInputFiles/"
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
dummyStock = pd.read_csv(myPath + "StockChart_20170615.csv", header=None)
dummyStock.columns = ["Symbol", "Data"]
# Data is a single space-delimited string of Date - Open - High - Low - Close - Volume
dummyStockSplit = dummyStock["Data"].str.split()
dummyDates = [datetime.strptime(x[0], "%m/%d/%Y") for x in dummyStockSplit]
dummyClose = [float(x[4]) for x in dummyStockSplit]
dfStock = pd.DataFrame( {"date":dummyDates, "symbol":dummyStock["Symbol"] , "close":dummyClose} )
df = dfStock.pivot(index="date", columns="symbol", values="close").resample("M").max()
# df is 12 x 4 with columns Month-AAPL-GOOG-IBM
# Create a list of y-axis column names: y_columns
y_columns = ["AAPL", "IBM"]
# Generate a line plot
df.plot(y=y_columns)
# Add the title
plt.title('Monthly stock prices')
# Add the y-axis label
plt.ylabel('Price ($US)')
# Display the plot
# plt.show()
plt.savefig("_dummyPy055.png", bbox_inches="tight")
plt.clf()
# Here, df appears to be the mtcars data
# Saved file from R
df = pd.read_csv(myPath + "mtcars.csv", index_col=0)
# sizes is a pre-defined np.array(), not sure of what
sizes = df["cyl"]
# Generate a scatter plot
df.plot(kind="scatter", x='hp', y='mpg', s=5*(sizes-3))
# Add the title
plt.title('Fuel efficiency vs Horse-power')
# Add the x-axis label
plt.xlabel('Horse-power')
# Add the y-axis label
plt.ylabel('Fuel efficiency (mpg)')
# Display the plot
# plt.show()
plt.savefig("_dummyPy056.png", bbox_inches="tight")
plt.clf()
# Make a list of the column names to be plotted: cols
cols = ["wt", "mpg"]
# Generate the box plots
df[cols].plot(kind="box", subplots=True)
# Display the plot
# plt.show()
plt.savefig("_dummyPy057.png", bbox_inches="tight")
plt.clf()
# Here, df is the tipping data from the Seaborn package, with emphasis on the column "fraction"
# Create a reasonable analog based on the pre-made CSV
tips = pd.read_csv(myPath + "tips.csv")
tips.sex = tips["sex"].astype("category")
tips.smoker = tips["smoker"].astype("category")
tips['total_bill'] = pd.to_numeric(tips["total_bill"], errors="coerce")
tips['tip'] = pd.to_numeric(tips["tip"], errors="coerce")
tips["fraction"] = tips["tip"] / tips["total_bill"]
df = tips.copy()
# This formats the plots such that they appear on separate rows
fig, axes = plt.subplots(nrows=2, ncols=1)
# Plot the PDF and CDF on the two axes
df.fraction.plot(ax=axes[0], kind='hist', bins=30, normed=True, range=(0,.3))
df.fraction.plot(ax=axes[1], kind="hist", bins=30, normed=True, cumulative=True, range=(0,.3))
# plt.show()
plt.savefig("_dummyPy058.png", bbox_inches="tight")
plt.clf()
# df is degrees by gender from http://nces.ed.gov/programs/digest/2013menu_tables.asp
# DO NOT HAVE DATASET - skip
# Print the minimum value of the Engineering column
# print(df["Engineering"].min())
# Print the maximum value of the Engineering column
# print(df["Engineering"].max())
# Construct the mean percentage per year: mean
# mean = df.mean(axis="columns")
# Plot the average percentage per year
# mean.plot()
# Display the plot
# plt.show()
# Now, df appears to be the Titanic dataset (not the table)
df = pd.read_csv(myPath + "titanic.csv")
# Print summary statistics of the fare column with .describe()
print(df["Fare"].describe())
# Generate a box plot of the fare column
df["Fare"].plot(kind="box")
# Show the plot
# plt.show()
plt.savefig("_dummyPy059.png", bbox_inches="tight")
plt.clf()
# Now, df is the life-expectancy Gapminder data as 260x219
# Needs the encoding to load
df = pd.read_csv(myPath + "gapminder.csv", encoding="latin-1", index_col=0).pivot_table(index="country", columns="year", values="life_expectancy")
# Print the number of countries reported in 2015
print(df[2015].count())
# Print the 5th and 95th percentiles
print(df.quantile([0.05, 0.95]))
# Generate a box plot
years = [1800, 1850, 1900, 1950, 2000]
df[years].plot(kind='box')
# plt.show()
plt.savefig("_dummyPy060.png", bbox_inches="tight")
plt.clf()
# Now, df is Pittsburgh weather data from https://www.wunderground.com/history/
# NEED TO GET THIS DATA
# january and march are both 31x2 with the columns being Date-Temperature
df = pd.read_csv(myPath + "KPIT_Temps_Small.csv")
january = df[["Date", "jan"]]
march = df[["Date", "mar"]]
# Print the mean of the January and March data
print(january.mean(), "\n", march.mean())
# Print the standard deviation of the January and March data
print(january.std(), "\n", march.std())
# Here, df is again automobile data of shape (392, 9)
# NEED TO GET THIS DATA - using MASS::Cars93 instead
tempDF = pd.read_csv(myPath + "Cars93.csv")
tempDF["Origin"]
df = tempDF[["Origin", "MPG.city", "MPG.highway", "Weight", "Horsepower"]]
# Compute the global mean and global standard deviation: global_mean, global_std
global_mean = df.mean()
global_std = df.std()
# Filter the US population from the origin column: us
us = df.loc[df["Origin"] == "USA", :]
# Compute the US mean and US standard deviation: us_mean, us_std
us_mean = us.mean()
us_std = us.std()
# Print the differences
print(us_mean - global_mean)
print(us_std - global_std)
# titanic is 1309x14 of data from the titanic
titanic = pd.read_csv(myPath + "titanic.csv", index_col=0)
# Display the box plots on 3 separate rows and 1 column
fig, axes = plt.subplots(nrows=3, ncols=1)
# Generate a box plot of the fare prices for the First passenger class
titanic.loc[titanic['Pclass'] == 1].plot(ax=axes[0], y='Fare', kind='box')
# Generate a box plot of the fare prices for the Second passenger class
titanic.loc[titanic['Pclass'] == 2].plot(ax=axes[1], y='Fare', kind='box')
# Generate a box plot of the fare prices for the Third passenger class
titanic.loc[titanic['Pclass'] == 3].plot(ax=axes[2], y='Fare', kind='box')
# Display the plot
# plt.show()
plt.savefig("_dummyPy061.png", bbox_inches="tight")
plt.clf()
## count 891.000000
## mean 32.204208
## std 49.693429
## min 0.000000
## 25% 7.910400
## 50% 14.454200
## 75% 31.000000
## max 512.329200
## Name: Fare, dtype: float64
## 208
## year 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 \
## 0.05 25.40 25.30 25.20 25.2 25.2 25.40 25.40 25.40 25.3 25.3
## 0.95 37.92 37.35 38.37 38.0 38.3 38.37 38.37 38.37 38.0 38.0
##
## year ... 2007 2008 2009 2010 2011 2012 2013 2014 \
## 0.05 ... 53.07 53.60 54.235 54.935 55.97 56.335 56.705 56.87
## 0.95 ... 80.73 80.93 81.200 81.365 81.60 81.665 81.830 82.00
##
## year 2015 2016
## 0.05 57.855 59.2555
## 0.95 82.100 82.1650
##
## [2 rows x 217 columns]
## Date 16.000000
## jan 26.096774
## dtype: float64
## Date 16.000000
## mar 43.612903
## dtype: float64
## Date 9.092121
## jan 10.514608
## dtype: float64
## Date 9.092121
## mar 8.503636
## dtype: float64
## MPG.city -1.407258
## MPG.highway -0.940188
## Weight 122.409274
## Horsepower 3.692876
## dtype: float64
## MPG.city -1.625356
## MPG.highway -1.180389
## Weight -24.668815
## Horsepower 2.080330
## dtype: float64
Maximum Stock Price by Month:
MPG vs HP (sized by Cylinders):
Box Plots for Weight and MPG (mtcars):
PDF and CDF for Tip as Percentage of Total Bill:
Box Plots for Titanic Fares:
Box Plot for Life Expectancy by Country (Gapminder):
Titanic Fares by Class (First, Second, Third):
Chapter 3 - Time series in pandas
Indexing pandas time series - dates and times are stored in datetime options:
Resampling pandas time series - taking statistical measures over different time intervals:
Manipulating pandas time series - changing the data in one or more columns:
Visualizing pandas time series - additional plotting techniques such line types, plot types, and sub-plots:
Example code includes:
myPath = "./PythonInputFiles/"
import pandas as pd
import matplotlib.pyplot as plt
# GREAT data is available at https://mesonet.agron.iastate.edu/request/download.phtml?network=IL_ASOS
# Downloaded KORD data from 2010 to myPath + "KORD_2010_from_IAState.txt"
# First 5 rows are commented, the sixth row is the header, and the next 10,443 rows are the data
# Load the file
tmpORD = pd.read_csv(myPath + "KORD_2010_from_IAState.txt", header=5)
tmpORD.columns = tmpORD.columns.str.strip()
isMETAR = tmpORD.loc[:, "valid"].str.contains(":51") # KORD METAR are taken at xx:51
useORD = tmpORD.loc[isMETAR, :] # ends as 8709 x 22, probably the METAR check missed a few at "off" times
date_list = useORD["valid"]
temperature_list = list(useORD["tmpf"])
# This is 8,759 temperature observations refelecting 20100101 00:00 through 20101231 23:00 on an hourly basis
# Prepare a format string: time_format
time_format = '%Y-%m-%d %H:%M'
# Convert date_list into a datetime object: my_datetimes
my_datetimes = pd.to_datetime(date_list, format=time_format)
# Construct a pandas Series using temperature_list and my_datetimes: time_series
# Something to explore later - this produced all np.nan if temperature_list were already a Series
ts0 = pd.Series(temperature_list, index=my_datetimes)
# Extract the hour from 9pm to 10pm on '2010-10-11': ts1
ts1 = ts0.loc['2010-10-11 20:51:00']
# Extract '2010-07-04' from ts0: ts2
ts2 = ts0.loc["2010-07-04"]
# Extract data from '2010-12-15' to '2010-12-31': ts3
ts3 = ts0.loc["2010-12-15":"2010-12-31"]
# Reindex without fill method: ts3
ts3 = ts2.reindex(ts0.index)
# Reindex with fill method, using forward fill: ts4
ts4 = ts2.reindex(ts0.index, method="ffill")
# Combine ts1 + ts2: sum12
sum12 = ts1 + ts2
# Combine ts1 + ts3: sum13
sum13 = ts1 + ts3
# Combine ts1 + ts4: sum14
sum14 = ts1 + ts4
# Still working with the temperature data, now renamed as df [technically, same index but containing Temperature-Dew Point-Pressure]
df = useORD[["tmpf", "dwpf", "alti"]]
df.index = my_datetimes
df.columns = ["Temperature", "DewPoint", "Pressure"]
saveWeather = df.copy()
# Downsample to 6 hour data and aggregate by mean: df1
df1 = df["Temperature"].resample("6H").mean()
# Downsample to daily data and count the number of data points: df2
df2 = df["Temperature"].resample("D").count()
# Extract temperature data for August: august
august = df.loc["2010-08", "Temperature"]
# Downsample to obtain only the daily highest temperatures in August: august_highs
august_highs = august.resample("D").max()
# Extract temperature data for February: february
february = df.loc["2010-02", "Temperature"]
# Downsample to obtain the daily lowest temperatures in February: february_lows
february_lows = february.resample("D").min()
# Extract data from 2010-Aug-01 to 2010-Aug-15: unsmoothed
unsmoothed = df['Temperature']["2010-08-01":"2010-08-15"]
# Apply a rolling mean with a 24 hour window: smoothed
smoothed = unsmoothed.rolling(window=24).mean()
# Create a new DataFrame with columns smoothed and unsmoothed: august
august = pd.DataFrame({'smoothed':smoothed, 'unsmoothed':unsmoothed})
# Plot both smoothed and unsmoothed data using august.plot().
august.plot()
# plt.show()
plt.savefig("_dummyPy062.png", bbox_inches="tight")
plt.clf()
# Extract the August 2010 data: august
august = df['Temperature']["2010-08"]
# Resample to daily data, aggregating by max: daily_highs
daily_highs = august.resample("D").max()
# Use a rolling 7-day window with method chaining to smooth the daily high temperatures in August
daily_highs_smoothed = daily_highs.rolling(window=7).mean()
print(daily_highs_smoothed)
# Plot the summer data
df = saveWeather.copy()
df.Temperature["2010-Jun":"2010-Aug"].plot()
# plt.show()
plt.savefig("_dummyPy063.png", bbox_inches="tight")
plt.clf()
# Plot the one week data
df.Temperature['2010-06-10':'2010-06-17'].plot()
# plt.show()
plt.savefig("_dummyPy064.png", bbox_inches="tight")
plt.clf()
# Now, df is 1741x17 of airline/airport data
# Saved the June 2011 data from hflights::hflights to csv
dfJun = pd.read_csv(myPath + "junFlights.csv")
dfJun["useMonth"] = ["{0:0>2}".format(x) for x in dfJun["Month"]]
dfJun["useDate"] = ["{0:0>2}".format(x) for x in dfJun["DayofMonth"]]
keyDates = dfJun["Year"].astype(str) + dfJun["useMonth"] + dfJun["useDate"]
time_format = '%Y%m%d'
useDates = pd.to_datetime(keyDates, format=time_format)
dfJun.index = useDates
df = dfJun[["DayOfWeek", "Dest", "DepTime", "ArrTime", "UniqueCarrier", "FlightNum"]]
df.columns = ["Weekday", "Destination Airport", "Wheels-off Time", "Arrival Time", "Carrier", "Flight"]
# Strip extra whitespace from the column names: df.columns
df.columns = df.columns.str.strip()
# Extract data for which the destination airport is Dallas: dallas
dallas = df['Destination Airport'].str.contains("DAL")
# Compute the total number of Dallas departures each day: daily_departures
daily_departures = dallas.resample("D").sum()
# Generate the summary statistics for daily Dallas departures: stats
stats = daily_departures.describe()
print(stats)
# Reset the index of ts2 to ts1, and then use linear interpolation to fill in the NaNs: ts2_interp
# ts2_interp = ts2.reindex(ts1.index).interpolate("linear")
# Compute the absolute difference of ts1 and ts2_interp: differences
# differences = np.abs(ts2_interp - ts1)
# Generate and print summary statistics of the differences
# print(differences.describe())
# Buid a Boolean mask to filter out all the 'LAX' departure flights: mask
import numpy as np
mask = df['Destination Airport'] == "LAX"
# Use the mask to subset the data: la
la = df[mask].dropna()
la["Date"] = la.index.astype(str)
la["Wheel Time"] = ["{0:0>4}".format(int(x)) for x in la["Wheels-off Time"]]
# Combine two columns of data to create a datetime series: times_tz_none
times_tz_none = pd.to_datetime(la["Date"] + " " + la["Wheel Time"])
# Localize the time to US/Central: times_tz_central
times_tz_central = times_tz_none.dt.tz_localize("US/Central")
# Convert the datetimes from US/Central to US/Pacific
times_tz_pacific = times_tz_central.dt.tz_convert("US/Pacific")
newDF = pd.DataFrame( {"Date":keyDates, "Carrier":list(df["Carrier"]), "nFlight":1} )
useCarrier = [x in ["XE", "CO", "WN", "OO"] for x in newDF["Carrier"]]
useDF = newDF.loc[useCarrier].pivot_table(index="Date", columns=["Carrier"], values=["nFlight"], aggfunc=sum)
# Plot the raw data before setting the datetime index
useDF.plot()
# plt.show()
plt.savefig("_dummyPy065.png", bbox_inches="tight")
plt.clf()
# Convert the 'Date' column into a collection of datetime objects: df.Date
useDF["Date"] = pd.to_datetime(useDF.index)
# Set the index to be the converted 'Date' column
useDF.set_index("Date", inplace=True) # inplace=True makes the conversion in place; no need to reassign
# Re-plot the DataFrame to see that the axis is now datetime aware!
useDF.plot()
# plt.show()
plt.savefig("_dummyPy066.png", bbox_inches="tight")
plt.clf()
## valid
## 2010-08-01 NaN
## 2010-08-02 NaN
## 2010-08-03 NaN
## 2010-08-04 NaN
## 2010-08-05 NaN
## 2010-08-06 NaN
## 2010-08-07 83.094286
## 2010-08-08 83.402857
## 2010-08-09 84.122857
## 2010-08-10 84.560000
## 2010-08-11 85.434286
## 2010-08-12 86.591429
## 2010-08-13 88.160000
## 2010-08-14 88.880000
## 2010-08-15 88.288571
## 2010-08-16 87.157143
## 2010-08-17 85.588571
## 2010-08-18 84.585714
## 2010-08-19 84.020000
## 2010-08-20 84.020000
## 2010-08-21 83.711429
## 2010-08-22 83.428571
## 2010-08-23 83.145714
## 2010-08-24 83.865714
## 2010-08-25 83.300000
## 2010-08-26 82.014286
## 2010-08-27 81.165714
## 2010-08-28 81.602857
## 2010-08-29 83.454286
## 2010-08-30 84.868571
## 2010-08-31 86.437143
## Freq: D, Name: Temperature, dtype: float64
## count 30.00000
## mean 26.30000
## std 4.05267
## min 17.00000
## 25% 25.75000
## 50% 28.00000
## 75% 28.00000
## max 30.00000
## Name: Destination Airport, dtype: float64
Chicago Temperatures (KORD) - August 2010:
Chicago Temperatures (KORD) - Summer 2010:
Chicago Temperatures (KORD) - June 10-17, 2010:
Flights per Day (Top 4 Carriers) - Houston, June 2011:
Index Formatted as Date-Time rather than String:
Chapter 4 - Case Study - Sunlight in Austin
Reading and cleaning the data - messy weather and climate data for Austin:
Statistical exploratory data analysis - slicing time series and the like:
Visual exploratory data analysis - histograms, line plots, box plots, and the like:
Example code includes:
myPath = "./PythonInputFiles/"
# Import pandas
import pandas as pd
# GREAT data is available at https://mesonet.agron.iastate.edu/request/download.phtml?network=TX_ASOS
# Downloaded KORD data from 2011 to myPath + "KAUS_2011_from_IAState.txt"
tmpAUS = pd.read_csv(myPath + "KAUS_2011_from_IAState.txt", header=5)
tmpAUS.columns = tmpAUS.columns.str.strip()
isMETAR = tmpAUS.loc[:, "valid"].str.contains(":53") # KAUS METAR are taken at xx:53
useAUS = tmpAUS.loc[isMETAR, :] # ends as 11,352 x 22, tons of duplicate METAR
useAUS = useAUS.drop_duplicates(subset=["valid"]) # ends as 8,432 x 22, some days with as few as 15 records
# First 5 rows are commented, the sixth row is the header, and the next 10,443 rows are the data
# Read in the data file: df
# df = pd.read_csv("data.csv")
df = useAUS.copy()
df["date"] = [x.split()[0] for x in df["valid"]]
df["time"] = [x.split()[1] for x in df["valid"]]
df["StationType"] = "Airport"
df["sky_condition"] = df["skyc1"] + df["skyc2"] + df["skyc3"] + df["skyc4"]
# Print the output of df.head()
print(df.head())
# This is the column_labels list (my data is different - modify)
# column_labels = "Wban,date,Time,StationType,sky_condition,sky_conditionFlag,visibility,visibilityFlag,wx_and_obst_to_vision,wx_and_obst_to_visionFlag,dry_bulb_faren,dry_bulb_farenFlag,dry_bulb_cel,dry_bulb_celFlag,wet_bulb_faren,wet_bulb_farenFlag,wet_bulb_cel,wet_bulb_celFlag,dew_point_faren,dew_point_farenFlag,dew_point_cel,dew_point_celFlag,relative_humidity,relative_humidityFlag,wind_speed,wind_speedFlag,wind_direction,wind_directionFlag,value_for_wind_character,value_for_wind_characterFlag,station_pressure,station_pressureFlag,pressure_tendency,pressure_tendencyFlag,presschange,presschangeFlag,sea_level_pressure,sea_level_pressureFlag,record_type,hourly_precip,hourly_precipFlag,altimeter,altimeterFlag,junk"
# list_to_drop = ['sky_conditionFlag', 'visibilityFlag', 'wx_and_obst_to_vision', 'wx_and_obst_to_visionFlag', 'dry_bulb_farenFlag', 'dry_bulb_celFlag', 'wet_bulb_farenFlag', 'wet_bulb_celFlag', 'dew_point_farenFlag', 'dew_point_celFlag', 'relative_humidityFlag', 'wind_speedFlag', 'wind_directionFlag', 'value_for_wind_character', 'value_for_wind_characterFlag', 'station_pressureFlag', 'pressure_tendencyFlag', 'pressure_tendency', 'presschange', 'presschangeFlag', 'sea_level_pressureFlag', 'hourly_precip', 'hourly_precipFlag', 'altimeter', 'record_type', 'altimeterFlag', 'junk']
# Desired variables to be kept
# final_keep = ["Wban", "StationType", "date", "Time", "dry_bulb_faren", "dew_point_faren", "wet_bulb_faren", "dry_bulb_cel", "dew_point_cel", "wet_bulb_cel", "sky_condition", "station_pressure", "sea_level_pressure", "relative humidity", "wind_direction", "wind_speed", "visibility"]
final_keep = ["Wban", "StationType", "date", "Time", "dry_bulb_faren", "dew_point_faren", "sky_condition", "station_pressure", "sea_level_pressure", "relative humidity", "wind_direction", "wind_speed", "visibility"]
# Remove the appropriate columns: df_dropped
# df_dropped = df.drop(list_to_drop, axis="columns")
df_dropped = df.iloc[:, [0, 24, 22, 23, 2, 3, 25, 8, 9, 4, 5, 6, 10]]
df_dropped.columns = final_keep
# Print the output of df_dropped.head()
print(df_dropped.head())
print(df_dropped.shape)
# Convert the date column to string: df_dropped['date']
# df_dropped['date'] = df_dropped["date"].astype(str)
# Pad leading zeros to the Time column: df_dropped['Time']
# df_dropped['Time'] = df_dropped['Time'].apply(lambda x:'{:0>4}'.format(x))
# Concatenate the new date and Time columns: date_string
date_string = df_dropped['date'] + " " + df_dropped['Time']
# Convert the date_string Series to datetime: date_times
date_times = pd.to_datetime(date_string, format='%Y-%m-%d %H:%M')
# Set the index to be the new date_times container: df_clean
df_clean = df_dropped.set_index(date_times)
# Eliminate straggler record with index in 2010
is2011 = df_clean.index.year == 2011
df_clean = df_clean.loc[is2011, :]
# Print the output of df_clean.head()
print(df_clean.head())
print(df_clean.shape)
# Print the dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011
print(df_clean.loc["2011-06-20 08:00:00":"2011-06-20 09:00:00", "dry_bulb_faren"])
# Convert the dry_bulb_faren column to numeric values: df_clean['dry_bulb_faren']
df_clean['dry_bulb_faren'] = pd.to_numeric(df_clean['dry_bulb_faren'], errors="coerce")
# Print the transformed dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011
print(df_clean.loc["2011-06-20 08:00:00":"2011-06-20 09:00:00", "dry_bulb_faren"])
# Convert the wind_speed and dew_point_faren columns to numeric values
df_clean['wind_speed'] = pd.to_numeric(df_clean['wind_speed'], errors="coerce")
df_clean['dew_point_faren'] = pd.to_numeric(df_clean['dew_point_faren'], errors="coerce")
df_clean['visibility'] = pd.to_numeric(df_clean['visibility'], errors="coerce")
# Print the median of the dry_bulb_faren column
print(df_clean["dry_bulb_faren"].median())
# Print the median of the dry_bulb_faren column for the time range '2011-Apr':'2011-Jun'
print(df_clean.loc["2011-04":"2011-06", 'dry_bulb_faren'].median())
# Print the median of the dry_bulb_faren column for the month of January
print(df_clean.loc["2011-01", 'dry_bulb_faren'].median())
# Downsample df_clean by day and aggregate by mean: daily_mean_2011
daily_mean_2011 = df_clean.resample("D").mean()
# Extract the dry_bulb_faren column from daily_mean_2011 using .values: daily_temp_2011
daily_temp_2011 = daily_mean_2011["dry_bulb_faren"].values
# NEED FILE!
# Downsample df_climate by day and aggregate by mean: daily_climate
# daily_climate = df_climate.resample("D").mean()
# Extract the Temperature column from daily_climate using .reset_index(): daily_temp_climate
# daily_temp_climate = daily_climate.reset_index()["Temperature"]
# Compute the difference between the two arrays and print the mean difference
# difference = daily_temp_2011 - daily_temp_climate
# print(difference.mean())
# Select days that are sunny: sunny
sunny = df_clean.loc[df_clean["sky_condition"].str.strip() == "CLR"]
# Select days that are overcast: overcast
overcast = df_clean.loc[df_clean["sky_condition"].str.contains("OVC")]
# Resample sunny and overcast, aggregating by maximum daily temperature
sunny_daily_max = sunny.resample("D").max()
overcast_daily_max = overcast.resample("D").max()
# Print the difference between the mean of sunny_daily_max and overcast_daily_max
print(sunny_daily_max.mean() - overcast_daily_max.mean())
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Select the visibility and dry_bulb_faren columns and resample them: weekly_mean
weekly_mean = df_clean[["visibility", "dry_bulb_faren"]].resample("W").mean()
# Print the output of weekly_mean.corr()
print(weekly_mean.corr())
# Plot weekly_mean with subplots=True
weekly_mean.plot(subplots=True)
# plt.show()
plt.savefig("_dummyPy067.png", bbox_inches="tight")
plt.clf()
# Create a Boolean Series for sunny days: sunny
sunny = df_clean["sky_condition"].str.strip() == "CLR"
# Resample the Boolean Series by day and compute the sum: sunny_hours
sunny_hours = sunny.resample("D").sum()
# Resample the Boolean Series by day and compute the count: total_hours
total_hours = sunny.resample("D").count()
# Divide sunny_hours by total_hours: sunny_fraction
sunny_fraction = sunny_hours / total_hours
# Make a box plot of sunny_fraction
sunny_fraction.plot(kind="box")
# plt.show()
plt.savefig("_dummyPy068.png", bbox_inches="tight")
plt.clf()
# Resample dew_point_faren and dry_bulb_faren by Month, aggregating the maximum values: monthly_max
monthly_max = df_clean[['dew_point_faren', 'dry_bulb_faren']].resample("M").max()
# Generate a histogram with bins=8, alpha=0.5, subplots=True
monthly_max.plot(kind="hist", bins=8, alpha=0.5, subplots=True)
# Show the plot
# plt.show()
plt.savefig("_dummyPy069.png", bbox_inches="tight")
plt.clf()
# Recall that df_climate is a separate dataset of the 1981-2010 data
# NEED DATASET
# Extract the maximum temperature in August 2010 from df_climate: august_max
# august_max = df_climate.loc["2010-Aug", "Temperature"].max()
# print(august_max)
# Resample the August 2011 temperatures in df_clean by day and aggregate the maximum value: august_2011
# august_2011 = df_clean.loc["2011-Aug", "dry_bulb_faren"].resample("D").max()
# Filter out days in august_2011 where the value exceeded august_max: august_2011_high
# august_2011_high = august_2011.loc[august_2011 > august_max]
# Construct a CDF of august_2011_high
# august_2011_high.plot(kind="hist", bins=25, normed=True, cumulative=True)
# Display the plot
# plt.show()
## station valid tmpf dwpf relh drct sknt p01i alti \
## 0 AUS 2010-12-31 23:53 50.00 17.96 27.75 360.00 10.00 M 29.93
## 1 AUS 2011-01-01 00:53 51.08 15.08 23.54 360.00 13.00 M 29.95
## 2 AUS 2011-01-01 01:53 51.08 14.00 22.45 340.00 9.00 M 30.02
## 3 AUS 2011-01-01 02:53 51.08 12.92 21.41 10.00 13.00 M 30.02
## 4 AUS 2011-01-01 03:53 50.00 17.06 26.70 350.00 6.00 M 30.04
##
## mslp ... skyl1 skyl2 skyl3 skyl4 presentwx \
## 0 1013.20 ... 3900.00 M M M M
## 1 1014.20 ... 4500.00 M M M M
## 2 1016.20 ... 4900.00 M M M M
## 3 1016.20 ... 6000.00 M M M M
## 4 1017.00 ... 6500.00 M M M M
##
## metar date time \
## 0 KAUS 010553Z 36010KT 10SM BKN039 10/M08 A2993 ... 2010-12-31 23:53
## 1 KAUS 010653Z 36013KT 10SM OVC045 11/M09 A2995 ... 2011-01-01 00:53
## 2 KAUS 010753Z 34009KT 10SM OVC049 11/M10 A3002 ... 2011-01-01 01:53
## 3 KAUS 010853Z 01013KT 10SM OVC060 11/M11 A3002 ... 2011-01-01 02:53
## 4 KAUS 010953Z 35006KT 10SM OVC065 10/M08 A3004 ... 2011-01-01 03:53
##
## StationType sky_condition
## 0 Airport BKN
## 1 Airport OVC
## 2 Airport OVC
## 3 Airport OVC
## 4 Airport OVC
##
## [5 rows x 26 columns]
## Wban StationType date Time dry_bulb_faren dew_point_faren \
## 0 AUS Airport 2010-12-31 23:53 50.00 17.96
## 1 AUS Airport 2011-01-01 00:53 51.08 15.08
## 2 AUS Airport 2011-01-01 01:53 51.08 14.00
## 3 AUS Airport 2011-01-01 02:53 51.08 12.92
## 4 AUS Airport 2011-01-01 03:53 50.00 17.06
##
## sky_condition station_pressure sea_level_pressure relative humidity \
## 0 BKN 29.93 1013.20 27.75
## 1 OVC 29.95 1014.20 23.54
## 2 OVC 30.02 1016.20 22.45
## 3 OVC 30.02 1016.20 21.41
## 4 OVC 30.04 1017.00 26.70
##
## wind_direction wind_speed visibility
## 0 360.00 10.00 10.00
## 1 360.00 13.00 10.00
## 2 340.00 9.00 10.00
## 3 10.00 13.00 10.00
## 4 350.00 6.00 10.00
## (8432, 13)
## Wban StationType date Time dry_bulb_faren \
## 2011-01-01 00:53:00 AUS Airport 2011-01-01 00:53 51.08
## 2011-01-01 01:53:00 AUS Airport 2011-01-01 01:53 51.08
## 2011-01-01 02:53:00 AUS Airport 2011-01-01 02:53 51.08
## 2011-01-01 03:53:00 AUS Airport 2011-01-01 03:53 50.00
## 2011-01-01 04:53:00 AUS Airport 2011-01-01 04:53 50.00
##
## dew_point_faren sky_condition station_pressure \
## 2011-01-01 00:53:00 15.08 OVC 29.95
## 2011-01-01 01:53:00 14.00 OVC 30.02
## 2011-01-01 02:53:00 12.92 OVC 30.02
## 2011-01-01 03:53:00 17.06 OVC 30.04
## 2011-01-01 04:53:00 15.08 BKN 30.04
##
## sea_level_pressure relative humidity wind_direction \
## 2011-01-01 00:53:00 1014.20 23.54 360.00
## 2011-01-01 01:53:00 1016.20 22.45 340.00
## 2011-01-01 02:53:00 1016.20 21.41 10.00
## 2011-01-01 03:53:00 1017.00 26.70 350.00
## 2011-01-01 04:53:00 1017.20 24.50 20.00
##
## wind_speed visibility
## 2011-01-01 00:53:00 13.00 10.00
## 2011-01-01 01:53:00 9.00 10.00
## 2011-01-01 02:53:00 13.00 10.00
## 2011-01-01 03:53:00 6.00 10.00
## 2011-01-01 04:53:00 10.00 10.00
## (8431, 13)
## 2011-06-20 08:53:00 80.06
## Name: dry_bulb_faren, dtype: object
## 2011-06-20 08:53:00 80.06
## Name: dry_bulb_faren, dtype: float64
## 73.04
## 78.8
## 46.94
## dry_bulb_faren 6.827911
## dew_point_faren -3.915446
## station_pressure -0.002711
## wind_speed -2.321292
## visibility 0.174696
## dtype: float64
## visibility dry_bulb_faren
## visibility 1.000000 0.456775
## dry_bulb_faren 0.456775 1.000000
Mean Visibility and Temperature - Austin, TX 2011:
Percentage of Time with Clear Skies (CLR/SKC) by Day - Austin, TX 2011:
Histogram for Maximum Monthly Temperature and Dew Point - Austin, TX 2011:
Chapter 1 - Extracting and transforming data
Indexing DataFrames - multiple ways to extract data from the pandas DataFrame:
Slicing DataFrames - different return types that come from indexing a pandas DataFrame:
Filtering DataFrames - general tool for selecting part of the data based on its properties rather than its indices (typically by way of Booleans):
Transforming DataFrames - best practice is to use built-in pandas methods, and otherwise by universal numpy methods:
Example code includes:
myPath = "./PythonInputFiles/"
import pandas as pd
# NEED DATA FRAME election (67 x 8) - indexed by county with columns state (PA) - total - Obama - Romney - winner - voters - turnout - margin
# appears to be 2012 US general election data, with the Obama and Romney columns being percentages, total being total votes, and voters being registered voters
# Saved the DataCamp file to myPath + "PAElection_2012.csv"
electionPA = pd.read_csv(myPath + "PAElection_2012.csv", index_col="county")
election = electionPA.copy()
# Assign the row position of election.loc['Bedford']: x
x = 4
# Assign the column position of election['winner']: y
y = 4
# Print the boolean equivalence
print(election.iloc[x, y] == election.loc['Bedford', 'winner'])
# DO NOT RUN - downloaded to myPath + "PAElection2012.csv" instead
# filename = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/pennsylvania2012.csv'
# election = pd.read_csv(filename, index_col='county')
# Create a separate dataframe with the columns ['winner', 'total', 'voters']: results
results = election[['winner', 'total', 'voters']]
# Print the output of results.head()
print(results.head())
# Slice the columns from the starting column to 'Obama': left_columns
left_columns = election.loc[:, :"Obama"]
# Print the output of left_columns.head()
print(left_columns.head())
# Slice the columns from 'Obama' to 'winner': middle_columns
middle_columns = election.loc[:, "Obama":"winner"]
# Print the output of middle_columns.head()
print(middle_columns.head())
# Slice the columns from 'Romney' to the end: 'right_columns'
right_columns = election.loc[:, "Romney":]
# Print the output of right_columns.head()
print(right_columns.head())
# Create the list of row labels: rows
rows = ['Philadelphia', 'Centre', 'Fulton']
# Create the list of column labels: cols
cols = ['winner', 'Obama', 'Romney']
# Create the new DataFrame: three_counties
three_counties = election.loc[rows, cols]
# Print the three_counties DataFrame
print(three_counties)
# Create a turnout category
election["turnout"] = 100 * election["total"] / election["voters"]
# Create the boolean array: high_turnout
high_turnout = election["turnout"] > 70
# Filter the election DataFrame with the high_turnout array: high_turnout_df
high_turnout_df = election[high_turnout]
# Print the high_turnout_results DataFrame
print(high_turnout_df)
# Import numpy
import numpy as np
# Create the election["margin"] column
election["margin"] = abs(election["Obama"] - election["Romney"])
# Create the boolean array: too_close
too_close = election["margin"] < 1
# Assign np.nan to the 'winner' column where the results were too close to call
election["winner"][too_close] = np.nan
# Print the output of election.info()
print(election.info())
# NEED DATASET titanic (1309 x 14)
# User version saved previously
titanic = pd.read_csv(myPath + 'titanic.csv', index_col=0)
# Select the 'age' and 'cabin' columns: df
df = titanic[["Age", "Cabin"]]
# Print the shape of df
print(df.shape)
# Drop rows in df with how='any' and print the shape
print(df.dropna(how="any").shape)
# Drop rows in df with how='all' and print the shape
print(df.dropna(how="all").shape)
# Call .dropna() with thresh=1000 and axis='columns' and print the output of .info() from titanic
print(titanic.dropna(thresh=500, axis='columns').info())
# NEED DATASET weather which is 365 x 23 from Weather Underground, representing Pittsburgh weather data for 2013
# https://www.wunderground.com/history
# Use the KORD METAR data instead
# Load the file
tmpORD = pd.read_csv(myPath + "KORD_2010_from_IAState.txt", header=5)
tmpORD.columns = tmpORD.columns.str.strip()
isMETAR = tmpORD.loc[:, "valid"].str.contains(":51") # KORD METAR are taken at xx:51
useORD = tmpORD.loc[isMETAR, :] # ends as 8709 x 22, probably the METAR check missed a few at "off" times
date_list = useORD["valid"]
time_format = '%Y-%m-%d %H:%M'
my_datetimes = pd.to_datetime(date_list, format=time_format)
useORD.index = my_datetimes
# Just keep the temperature and dew point
weather = useORD[["tmpf", "dwpf"]]
weather.columns = ['Mean TemperatureF','Mean Dew PointF']
# Write a function to convert degrees Fahrenheit to degrees Celsius: to_celsius
def to_celsius(F):
return 5/9*(F - 32)
# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': df_celsius
df_celsius = weather[['Mean TemperatureF','Mean Dew PointF']].apply(to_celsius)
# Reassign the columns df_celsius
df_celsius.columns = ['Mean TemperatureC', 'Mean Dew PointC']
# Print the output of df_celsius.head()
print(df_celsius.head())
# Create the dictionary: red_vs_blue
red_vs_blue = {"Obama":"blue", "Romney":"red"}
# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election["winner"].map(red_vs_blue)
# Print the output of election.head()
print(election.head())
# Import zscore from scipy.stats
# Need to solve BLAS/LAPACK issue - cannot get scipy to download and install . . .
# from scipy.stats import zscore
import numpy as np
def zscore(x):
mu = np.mean(x)
sd = np.std(x)
return((x - mu) / sd)
# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election["turnout"])
# Print the type of turnout_zscore
print(type(turnout_zscore))
# Assign turnout_zscore to a new column: election['turnout_zscore']
election["turnout_zscore"] = turnout_zscore
# Print the output of election.head()
print(election.head())
## -c:90: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame
##
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
## True
## winner total voters
## county
## Adams Romney 41973 61156
## Allegheny Obama 614671 924351
## Armstrong Romney 28322 42147
## Beaver Romney 80015 115157
## Bedford Romney 21444 32189
## state total Obama
## county
## Adams PA 41973 35.482334
## Allegheny PA 614671 56.640219
## Armstrong PA 28322 30.696985
## Beaver PA 80015 46.032619
## Bedford PA 21444 22.057452
## Obama Romney winner
## county
## Adams 35.482334 63.112001 Romney
## Allegheny 56.640219 42.185820 Obama
## Armstrong 30.696985 67.901278 Romney
## Beaver 46.032619 52.637630 Romney
## Bedford 22.057452 76.986570 Romney
## Romney winner voters
## county
## Adams 63.112001 Romney 61156
## Allegheny 42.185820 Obama 924351
## Armstrong 67.901278 Romney 42147
## Beaver 52.637630 Romney 115157
## Bedford 76.986570 Romney 32189
## winner Obama Romney
## county
## Philadelphia Obama 85.224251 14.051451
## Centre Romney 48.948416 48.977486
## Fulton Romney 21.096291 77.748861
## state total Obama Romney winner voters turnout
## county
## Bucks PA 319407 49.966970 48.801686 Obama 435606 73.324748
## Butler PA 88924 31.920516 66.816607 Romney 122762 72.436096
## Chester PA 248295 49.228539 49.650617 Romney 337822 73.498766
## Forest PA 2308 38.734835 59.835355 Romney 3232 71.410891
## Franklin PA 62802 30.110506 68.583803 Romney 87406 71.850903
## Montgomery PA 401787 56.637223 42.286834 Obama 551105 72.905708
## Westmoreland PA 168709 37.567646 61.306154 Romney 238006 70.884347
## <class 'pandas.core.frame.DataFrame'>
## Index: 67 entries, Adams to York
## Data columns (total 8 columns):
## state 67 non-null object
## total 67 non-null int64
## Obama 67 non-null float64
## Romney 67 non-null float64
## winner 64 non-null object
## voters 67 non-null int64
## turnout 67 non-null float64
## margin 67 non-null float64
## dtypes: float64(4), int64(2), object(2)
## memory usage: 5.4+ KB
## None
## (891, 2)
## (185, 2)
## (733, 2)
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 891 entries, 1 to 891
## Data columns (total 11 columns):
## PassengerId 891 non-null int64
## Survived 891 non-null int64
## Pclass 891 non-null int64
## Name 891 non-null object
## Sex 891 non-null object
## Age 714 non-null float64
## SibSp 891 non-null int64
## Parch 891 non-null int64
## Ticket 891 non-null object
## Fare 891 non-null float64
## Embarked 889 non-null object
## dtypes: float64(2), int64(5), object(4)
## memory usage: 69.6+ KB
## None
## Mean TemperatureC Mean Dew PointC
## valid
## 2010-01-01 00:51:00 -9.4 -16.1
## 2010-01-01 01:51:00 -10.0 -16.1
## 2010-01-01 02:51:00 -11.1 -16.1
## 2010-01-01 03:51:00 -11.7 -16.7
## 2010-01-01 04:51:00 -12.2 -16.7
## state total Obama Romney winner voters turnout \
## county
## Adams PA 41973 35.482334 63.112001 Romney 61156 68.632677
## Allegheny PA 614671 56.640219 42.185820 Obama 924351 66.497575
## Armstrong PA 28322 30.696985 67.901278 Romney 42147 67.198140
## Beaver PA 80015 46.032619 52.637630 Romney 115157 69.483401
## Bedford PA 21444 22.057452 76.986570 Romney 32189 66.619031
##
## margin color
## county
## Adams 27.629667 red
## Allegheny 14.454399 blue
## Armstrong 37.204293 red
## Beaver 6.605012 red
## Bedford 54.929118 red
## <class 'pandas.core.series.Series'>
## state total Obama Romney winner voters turnout \
## county
## Adams PA 41973 35.482334 63.112001 Romney 61156 68.632677
## Allegheny PA 614671 56.640219 42.185820 Obama 924351 66.497575
## Armstrong PA 28322 30.696985 67.901278 Romney 42147 67.198140
## Beaver PA 80015 46.032619 52.637630 Romney 115157 69.483401
## Bedford PA 21444 22.057452 76.986570 Romney 32189 66.619031
##
## margin color turnout_zscore
## county
## Adams 27.629667 red 0.853734
## Allegheny 14.454399 blue 0.439846
## Armstrong 37.204293 red 0.575650
## Beaver 6.605012 red 1.018647
## Bedford 54.929118 red 0.463391
Chapter 2 - Advanced Indexing
Index objects and labeled data - one of the key building blocks of the pandas Data Structures:
Hierarchical indexing - representing multi-dimensional index data:
Example code includes:
myPath = "./PythonInputFiles/"
import pandas as pd
import numpy as np
sales = pd.DataFrame()
sales["eggs"] = [47, 110, 221, 77, 132, 205]
sales["salt"] = [12, 50, 89, 87, np.nan, 60]
sales["spam"] = [17, 31, 72, 20, 52, 55]
sales.index = ["jan", "feb", "mar", "apr", "may", "jun"]
# Create the list of new indexes: new_idx
new_idx = [x.upper() for x in sales.index]
# Assign new_idx to sales.index
sales.index = new_idx
# Print the sales DataFrame
print(sales)
# Assign the string 'MONTHS' to sales.index.name
sales.index.name = "MONTHS"
# Print the sales DataFrame
print(sales)
# Assign the string 'PRODUCTS' to sales.columns.name
sales.columns.name = "PRODUCTS"
# Print the sales dataframe again
print(sales)
# Generate the list of months: months
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
# Assign months to sales.index
sales.index = months
# Print the modified sales DataFrame
print(sales)
# NEED TO MODIFY sales so it is the same data but indexed as CA/1, CA/2, NY/1, NY/2, TX/1, TX/2 (using state-month)
sales = sales.set_index([["CA", "CA", "NY", "NY","TX", "TX"], [1, 2, 1, 2, 1, 2]])
# Print sales.loc[['CA', 'TX']]
print(sales.loc[['CA', 'TX']])
# Print sales['CA':'TX']
print(sales['CA':'TX'])
# Now, sales is again a non-indexed DataFrame with sate-month as columns
# Set the index to be the columns ['state', 'month']: sales
states = [x for x, y in list(sales.index)]
months = [y for x, y in list(sales.index)]
sales.index = range(sales.shape[0])
sales["state"] = states
sales["month"] = months
oldSales = sales.copy()
sales = sales.set_index(['state', 'month'])
# Sort the MultiIndex: sales
sales = sales.sort_index(ascending=False)
# Print the sales DataFrame
print(sales)
multiSales = sales.copy()
# Go back to the sales as it was prior to indexing in the above step
# Set the index to the column 'state': sales
sales = oldSales.set_index(["state"])
# Print the sales DataFrame
print(sales)
# Access the data from 'NY'
print(sales.loc["NY"])
# Go back to sales as the Multi-Index dataset again . . .
sales = multiSales.copy()
sales = sales.sort_index(ascending=True) # Could not grab witout error unless ascending=True
# Look up data for NY in month 1: NY_month1
NY_month1 = sales.loc[ ("NY", 1) ]
# Look up data for CA and TX in month 2: CA_TX_month2
CA_TX_month2 = sales.loc[ (["CA", "TX"], 2) , :]
# Look up data for all states in month 2: all_month2
all_month2 = sales.loc[ (slice(None), 2), :]
## eggs salt spam
## JAN 47 12.0 17
## FEB 110 50.0 31
## MAR 221 89.0 72
## APR 77 87.0 20
## MAY 132 NaN 52
## JUN 205 60.0 55
## eggs salt spam
## MONTHS
## JAN 47 12.0 17
## FEB 110 50.0 31
## MAR 221 89.0 72
## APR 77 87.0 20
## MAY 132 NaN 52
## JUN 205 60.0 55
## PRODUCTS eggs salt spam
## MONTHS
## JAN 47 12.0 17
## FEB 110 50.0 31
## MAR 221 89.0 72
## APR 77 87.0 20
## MAY 132 NaN 52
## JUN 205 60.0 55
## PRODUCTS eggs salt spam
## Jan 47 12.0 17
## Feb 110 50.0 31
## Mar 221 89.0 72
## Apr 77 87.0 20
## May 132 NaN 52
## Jun 205 60.0 55
## PRODUCTS eggs salt spam
## CA 1 47 12.0 17
## 2 110 50.0 31
## TX 1 132 NaN 52
## 2 205 60.0 55
## PRODUCTS eggs salt spam
## CA 1 47 12.0 17
## 2 110 50.0 31
## NY 1 221 89.0 72
## 2 77 87.0 20
## TX 1 132 NaN 52
## 2 205 60.0 55
## PRODUCTS eggs salt spam
## state month
## TX 2 205 60.0 55
## 1 132 NaN 52
## NY 2 77 87.0 20
## 1 221 89.0 72
## CA 2 110 50.0 31
## 1 47 12.0 17
## PRODUCTS eggs salt spam month
## state
## CA 47 12.0 17 1
## CA 110 50.0 31 2
## NY 221 89.0 72 1
## NY 77 87.0 20 2
## TX 132 NaN 52 1
## TX 205 60.0 55 2
## PRODUCTS eggs salt spam month
## state
## NY 221 89.0 72 1
## NY 77 87.0 20 2
Chapter 3 - Rearranging and Reshaping Data
Pivoting DataFrames - changing shapes to one that better suits analysis needs:
Stacking and unstaking DataFrames - the idea of moving variables to/from the index so that the columns match data needs:
Melting DataFrames - converting pivoted data back in to a column format:
Pivot tables are needed when there are multiple rows with the same index (if pivoted) - need to specify how to manage the duplicates:
Example code includes:
myPath = "./PythonInputFiles/"
import pandas as pd
users=pd.DataFrame()
users["weekday"] = ["Sun", "Sun", "Mon", "Mon"]
users["city"] = ["Austin", "Dallas", "Austin", "Dallas"]
users["visitors"] = [139, 237, 326, 456]
users["signups"] = [7, 12, 3, 5]
# Pivot the users DataFrame: visitors_pivot
visitors_pivot = users.pivot(index="weekday", columns="city", values="visitors")
# Print the pivoted DataFrame
print(visitors_pivot)
# Pivot users with signups indexed by weekday and city: signups_pivot
signups_pivot = users.pivot(index="weekday", columns="city", values="signups")
# Print signups_pivot
print(signups_pivot)
# Pivot users pivoted by both signups and visitors: pivot
pivot = users.pivot(index="weekday", columns="city")
# Print the pivoted DataFrame
print(pivot)
a = users.set_index(["city", "weekday"])
users = a.sort_index()
# Unstack users by 'weekday': byweekday
byweekday = users.unstack(level="weekday")
# Print the byweekday DataFrame
print(byweekday)
# Stack byweekday by 'weekday' and print it
print(byweekday.stack(level="weekday"))
# Unstack users by 'city': bycity
bycity = users.unstack(level="city")
# Print the bycity DataFrame
print(bycity)
# Stack bycity by 'city' and print it
print(bycity.stack(level="city"))
# Stack 'city' back into the index of bycity: newusers
newusers = bycity.stack(level="city")
# Swap the levels of the index of newusers: newusers
newusers = newusers.swaplevel(0, 1)
# Print newusers and verify that the index is not sorted
print(newusers)
# Sort the index of newusers: newusers
newusers = newusers.sort_index()
# Print newusers and verify that the index is now sorted
print(newusers)
# Verify that the new DataFrame is equal to the original
print(newusers.equals(users))
visitors_by_city_weekday = users[["visitors"]].unstack(level="city").reset_index()
visitors_by_city_weekday.columns = ["weekday", "Austin", "Dallas"]
# Reset the index: visitors_by_city_weekday
# visitors_by_city_weekday = visitors_by_city_weekday.reset_index() # this needed to be done above to get the column names right . . .
# Print visitors_by_city_weekday
print(visitors_by_city_weekday)
# Melt visitors_by_city_weekday: visitors
visitors = pd.melt(visitors_by_city_weekday, id_vars=["weekday"], value_name="visitors", var_name="city")
# Print visitors
print(visitors)
users=pd.DataFrame()
users["weekday"] = ["Sun", "Sun", "Mon", "Mon"]
users["city"] = ["Austin", "Dallas", "Austin", "Dallas"]
users["visitors"] = [139, 237, 326, 456]
users["signups"] = [7, 12, 3, 5]
# Melt users: skinny
skinny = pd.melt(users, id_vars = ["weekday", "city"], value_vars=["visitors", "signups"])
# Print skinny
print(skinny)
# Set the new index: users_idx
users_idx = users.set_index(['city', 'weekday'])
# Print the users_idx DataFrame
print(users_idx)
# Obtain the key-value pairs: kv_pairs
kv_pairs = pd.melt(users_idx, col_level=0)
# Print the key-value pairs
print(kv_pairs)
# Create the DataFrame with the appropriate pivot table: by_city_day
by_city_day = users.pivot_table(index="weekday", columns="city")
# Print by_city_day
print(by_city_day)
# Use a pivot table to display the count of each column: count_by_weekday1
count_by_weekday1 = users.pivot_table(index="weekday", aggfunc="count")
# Print count_by_weekday
print(count_by_weekday1)
# Replace 'aggfunc='count'' with 'aggfunc=len': count_by_weekday2
count_by_weekday2 = users.pivot_table(index="weekday", aggfunc=len)
# Verify that the same result is obtained
print('==========================================')
print(count_by_weekday1.equals(count_by_weekday2))
# Create the DataFrame with the appropriate pivot table: signups_and_visitors
signups_and_visitors = users.pivot_table(index="weekday", aggfunc=sum)
# Print signups_and_visitors
print(signups_and_visitors)
# Add in the margins: signups_and_visitors_total
signups_and_visitors_total = users.pivot_table(index="weekday", aggfunc=sum, margins=True)
# Print signups_and_visitors_total
print(signups_and_visitors_total)
## city Austin Dallas
## weekday
## Mon 326 456
## Sun 139 237
## city Austin Dallas
## weekday
## Mon 3 5
## Sun 7 12
## visitors signups
## city Austin Dallas Austin Dallas
## weekday
## Mon 326 456 3 5
## Sun 139 237 7 12
## visitors signups
## weekday Mon Sun Mon Sun
## city
## Austin 326 139 3 7
## Dallas 456 237 5 12
## visitors signups
## city weekday
## Austin Mon 326 3
## Sun 139 7
## Dallas Mon 456 5
## Sun 237 12
## visitors signups
## city Austin Dallas Austin Dallas
## weekday
## Mon 326 456 3 5
## Sun 139 237 7 12
## visitors signups
## weekday city
## Mon Austin 326 3
## Dallas 456 5
## Sun Austin 139 7
## Dallas 237 12
## visitors signups
## city weekday
## Austin Mon 326 3
## Dallas Mon 456 5
## Austin Sun 139 7
## Dallas Sun 237 12
## visitors signups
## city weekday
## Austin Mon 326 3
## Sun 139 7
## Dallas Mon 456 5
## Sun 237 12
## True
## weekday Austin Dallas
## 0 Mon 326 456
## 1 Sun 139 237
## weekday city visitors
## 0 Mon Austin 326
## 1 Sun Austin 139
## 2 Mon Dallas 456
## 3 Sun Dallas 237
## weekday city variable value
## 0 Sun Austin visitors 139
## 1 Sun Dallas visitors 237
## 2 Mon Austin visitors 326
## 3 Mon Dallas visitors 456
## 4 Sun Austin signups 7
## 5 Sun Dallas signups 12
## 6 Mon Austin signups 3
## 7 Mon Dallas signups 5
## visitors signups
## city weekday
## Austin Sun 139 7
## Dallas Sun 237 12
## Austin Mon 326 3
## Dallas Mon 456 5
## variable value
## 0 visitors 139
## 1 visitors 237
## 2 visitors 326
## 3 visitors 456
## 4 signups 7
## 5 signups 12
## 6 signups 3
## 7 signups 5
## signups visitors
## city Austin Dallas Austin Dallas
## weekday
## Mon 3 5 326 456
## Sun 7 12 139 237
## city signups visitors
## weekday
## Mon 2 2 2
## Sun 2 2 2
## ==========================================
## True
## signups visitors
## weekday
## Mon 8 782
## Sun 19 376
## signups visitors
## weekday
## Mon 8.0 782.0
## Sun 19.0 376.0
## All 27.0 1158.0
Chapter 4 - Grouping data
Categoricals and groupby - using the .groupby() method and then chaining various commands to it:
Groupby and aggregation - running mutlipe calculations after the split and before the combine:
Groupby and transformation - applying different transformations to different groups:
Groupby and filtering - filtering groups prior to aggregating:
Example code includes:
myPath = "./PythonInputFiles/"
# Need to bring in "titanic" (1309 x 14)
import pandas as pd
titanic = pd.read_csv(myPath + 'titanic.csv', index_col=0)
titanic.columns = ['id', 'survived', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked']
# titanic.columns = ['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']
# Group titanic by 'pclass'
by_class = titanic.groupby("pclass")
# Aggregate 'survived' column of by_class by count
count_by_class = by_class["survived"].count()
# Print count_by_class
print(count_by_class)
# Group titanic by 'embarked' and 'pclass'
by_mult = titanic.groupby(["embarked", "pclass"])
# Aggregate 'survived' column of by_mult by count
count_mult = by_mult["survived"].count()
# Print count_mult
print(count_mult)
# Saved to myPath as lifeSaved.csv and regionsSaved.csv
# life_f = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/life_expectancy.csv'
# regions_f = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/regions.csv'
life = pd.read_csv(myPath + "lifeSaved.csv", index_col='Country', encoding="latin-1")
regions = pd.read_csv(myPath + "regionsSaved.csv", index_col='Country', encoding="latin-1")
# Group life by regions['region']: life_by_region
life_by_region = life.groupby(regions["region"])
# Print the mean over the '2010' column of life_by_region
print(life_by_region["2010"].mean())
# Again using the titanic dataset (same as above)
# Group titanic by 'pclass': by_class
by_class = titanic.groupby("pclass")
# Select 'age' and 'fare'
by_class_sub = by_class[['age','fare']]
# Aggregate by_class_sub by 'max' and 'median': aggregated
aggregated = by_class_sub.agg(["max", "median"])
# Print the maximum age in each class
print(aggregated.loc[:, ('age','max')])
# Print the median fare in each class
print(aggregated.loc[:, ('fare', 'median')])
# Read the CSV file into a DataFrame and sort the index: gapminder
# NEED FILE!
# gapminder = pd.read_csv("gapminder.csv", index_col=['Year','region','Country']).sort_index()
# Group gapminder by 'Year' and 'region': by_year_region
# by_year_region = gapminder.groupby(level=["Year", "region"])
# Define the function to compute spread: spread
# def spread(series):
# return series.max() - series.min()
# Create the dictionary: aggregator
# aggregator = {'population':'sum', 'child_mortality':'mean', 'gdp':spread}
# Aggregate by_year_region using the dictionary: aggregated
# aggregated = by_year_region.agg(aggregator)
# Print the last 6 entries of aggregated
# print(aggregated.tail(6))
# NEED FILE
# Read file: sales
# sales = pd.read_csv("sales.csv", index_col="Date", parse_dates=True)
# Create a groupby object: by_day
# by_day = sales.groupby(sales.index.strftime('%a'))
# Create sum: units_sum
# units_sum = by_day.sum()
# Print units_sum
# print(units_sum)
# Import zscore
# from scipy.stats import zscore
# Group gapminder_2010: standardized
# standardized = gapminder_2010.groupby("region")[['life','fertility']].transform(zscore)
# Construct a Boolean Series to identify outliers: outliers
# outliers = (standardized['life'] < -3) | (standardized['fertility'] > 3)
# Filter gapminder_2010 by the outliers: gm_outliers
# gm_outliers = gapminder_2010.loc[outliers]
# Print gm_outliers
# print(gm_outliers)
# Create a groupby object: by_sex_class
by_sex_class = titanic.groupby(["sex", "pclass"])
# Write a function that imputes median
def impute_median(series):
return series.fillna(series.median())
# Impute age and assign to titanic['age']
titanic.age = by_sex_class["age"].transform(impute_median)
# Print the output of titanic.tail(10)
print(titanic.tail(10))
def disparity(gr):
# Compute the spread of gr['gdp']: s
s = gr['gdp'].max() - gr['gdp'].min()
# Compute the z-score of gr['gdp'] as (gr['gdp']-gr['gdp'].mean())/gr['gdp'].std(): z
z = (gr['gdp'] - gr['gdp'].mean())/gr['gdp'].std()
# Return a DataFrame with the inputs {'z(gdp)':z, 'regional spread(gdp)':s}
return pd.DataFrame({'z(gdp)':z , 'regional spread(gdp)':s})
# NEED FILE!
# Group gapminder_2010 by 'region': regional
# regional = gapminder_2010.groupby("region")
# Apply the disparity function on regional: reg_disp
# reg_disp = regional.apply(disparity)
# Print the disparity of 'United States', 'United Kingdom', and 'China'
# print(reg_disp.loc[['United States','United Kingdom','China'], :])
def c_deck_survival(gr):
c_passengers = gr['cabin'].str.startswith('C').fillna(False)
return gr.loc[c_passengers, 'survived'].mean()
# Create a groupby object using titanic over the 'sex' column: by_sex
by_sex = titanic.groupby("sex")
# Call by_sex.apply with the function c_deck_survival and print the result
c_surv_by_sex = by_sex.apply(c_deck_survival)
# Print the survival rates
print(c_surv_by_sex)
# NEED FILE!
# Read the CSV file into a DataFrame: sales
# sales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True)
# Group sales by 'Company': by_company
# by_company = sales.groupby("Company")
# Compute the sum of the 'Units' of by_company: by_com_sum
# by_com_sum = by_company["Units"].sum()
# print(by_com_sum)
# Filter 'Units' where the sum is > 35: by_com_filt
# by_com_filt = by_company.filter(lambda g:g['Units'].sum() > 35)
# print(by_com_filt)
# Create the Boolean Series: under10
under10 = (titanic['age'] < 10).map({True:'under 10', False:'over 10'})
# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)["survived"].mean()
print(survived_mean_1)
# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10, "pclass"])["survived"].mean()
print(survived_mean_2)
## pclass
## 1 216
## 2 184
## 3 491
## Name: survived, dtype: int64
## embarked pclass
## C 1 85
## 2 17
## 3 66
## Q 1 2
## 2 3
## 3 72
## S 1 127
## 2 164
## 3 353
## Name: survived, dtype: int64
## region
## America 74.037350
## East Asia & Pacific 73.405750
## Europe & Central Asia 75.656387
## Middle East & North Africa 72.805333
## South Asia 68.189750
## Sub-Saharan Africa 57.575080
## Name: 2010, dtype: float64
## pclass
## 1 80.0
## 2 70.0
## 3 74.0
## Name: (age, max), dtype: float64
## pclass
## 1 60.2875
## 2 14.2500
## 3 8.0500
## Name: (fare, median), dtype: float64
## id survived pclass name sex \
## 882 882 0 3 Markun, Mr. Johann male
## 883 883 0 3 Dahlberg, Miss. Gerda Ulrika female
## 884 884 0 2 Banfield, Mr. Frederick James male
## 885 885 0 3 Sutehall, Mr. Henry Jr male
## 886 886 0 3 Rice, Mrs. William (Margaret Norton) female
## 887 887 0 2 Montvila, Rev. Juozas male
## 888 888 1 1 Graham, Miss. Margaret Edith female
## 889 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female
## 890 890 1 1 Behr, Mr. Karl Howell male
## 891 891 0 3 Dooley, Mr. Patrick male
##
## age sibsp parch ticket fare cabin embarked
## 882 33.0 0 0 349257 7.8958 NaN S
## 883 22.0 0 0 7552 10.5167 NaN S
## 884 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
## 885 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
## 886 39.0 0 5 382652 29.1250 NaN Q
## 887 27.0 0 0 211536 13.0000 NaN S
## 888 19.0 0 0 112053 30.0000 B42 S
## 889 21.5 1 2 W./C. 6607 23.4500 NaN S
## 890 26.0 0 0 111369 30.0000 C148 C
## 891 32.0 0 0 370376 7.7500 NaN Q
## sex
## female 0.888889
## male 0.343750
## dtype: float64
## age
## over 10 0.366707
## under 10 0.612903
## Name: survived, dtype: float64
## age pclass
## over 10 1 0.629108
## 2 0.419162
## 3 0.222717
## under 10 1 0.666667
## 2 1.000000
## 3 0.452381
## Name: survived, dtype: float64
Chapter 5 - Case Study (Summer Olympics)
Introduction to the Summer Olympics data and analysis objectives:
Understanding the column labels - looking at the Gender and event_gender columns to understand how they are different:
Constructing alternative country rankings:
Reshaping DataFrames for visualization:
Example code includes:
myPath = "./PythonInputFiles/"
import pandas as pd
import matplotlib.pyplot as plt
# Data is from https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data
# medals is 29216x10 with ['City', 'Edition', 'Sport', 'Discipline', 'Athlete', 'NOC', 'Gender', 'Event', 'Event_gender', 'Medal']
# Downloaded file from Guardian as myPath + "summerOlympics_Medalists_1896_2008.csv" - read file in
medals = pd.read_csv(myPath + "summerOlympics_Medalists_1896_2008.csv", header=4)
USA_edition_grouped = medals.loc[medals.NOC == 'USA'].groupby('Edition')
# Select the 'NOC' column of medals: country_names
country_names = medals["NOC"]
# Count the number of medals won by each country: medal_counts
medal_counts = country_names.value_counts()
# Print top 15 countries ranked by medals
print(medal_counts.head(15))
# Construct the pivot table: counted
counted = medals.pivot_table(index="NOC", columns="Medal", values="Athlete", aggfunc="count")
# Create the new column: counted['totals']
counted['totals'] = counted.sum(axis="columns")
# Sort counted by the 'totals' column
counted = counted.sort_values("totals", ascending=False)
# Print the top 15 rows of counted
print(counted.head(15))
# Select columns: ev_gen
ev_gen = medals[["Event_gender", "Gender"]]
# Drop duplicate pairs: ev_gen_uniques
ev_gen_uniques = ev_gen.drop_duplicates()
# Print ev_gen_uniques
print(ev_gen_uniques)
# Group medals by the two columns: medals_by_gender
medals_by_gender = medals.groupby(['Event_gender', 'Gender'])
# Create a DataFrame with a group count: medal_count_by_gender
medal_count_by_gender = medals_by_gender.count()
# Print medal_count_by_gender
print(medal_count_by_gender)
# Create the Boolean Series: sus
sus = (medals.Event_gender == 'W') & (medals.Gender == 'Men')
# Create a DataFrame with the suspicious row: suspect
suspect = medals.loc[sus, :]
# Print suspect
print(suspect)
# Group medals by 'NOC': country_grouped
country_grouped = medals.groupby("NOC")
# Compute the number of distinct sports in which each country won medals: Nsports
Nsports = country_grouped["Sport"].nunique()
# Sort the values of Nsports in descending order
Nsports = Nsports.sort_values(ascending=False)
# Print the top 15 rows of Nsports
print(Nsports.head(15))
# Extract all rows for which the 'Edition' is between 1952 & 1988: during_cold_war
during_cold_war = (medals["Edition"] >= 1952) & (medals["Edition"] <= 1988)
# Extract rows for which 'NOC' is either 'USA' or 'URS': is_usa_urs
is_usa_urs = medals.NOC.isin(["USA", "URS"])
# Use during_cold_war and is_usa_urs to create the DataFrame: cold_war_medals
cold_war_medals = medals.loc[during_cold_war & is_usa_urs]
# Group cold_war_medals by 'NOC'
country_grouped = cold_war_medals.groupby("NOC")
# Create Nsports
Nsports = country_grouped["Sport"].nunique().sort_values(ascending=False)
# Print Nsports
print(Nsports)
# Create the pivot table: medals_won_by_country
medals_won_by_country = medals.pivot_table(index="Edition", columns="NOC", values="Athlete", aggfunc="count")
# Slice medals_won_by_country: cold_war_usa_usr_medals
cold_war_usa_usr_medals = medals_won_by_country.loc[1952:1988, ["USA", "URS"]]
# Create most_medals
most_medals = cold_war_usa_usr_medals.idxmax(axis="columns")
# Print most_medals.value_counts()
print(most_medals.value_counts())
# Create the DataFrame: usa
usa = medals.loc[medals["NOC"] == "USA"]
# Group usa by ['Edition', 'Medal'] and aggregate over 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])["Athlete"].count()
# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level="Medal")
# Plot the DataFrame usa_medals_by_year
usa_medals_by_year.plot()
# plt.show()
plt.savefig("_dummyPy070.png", bbox_inches="tight")
plt.clf()
# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']
# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()
# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')
# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area()
# plt.show()
plt.savefig("_dummyPy071.png", bbox_inches="tight")
plt.clf()
# Redefine 'Medal' as an ordered categorical
medals.Medal = pd.Categorical(values=medals.Medal, categories=['Bronze', 'Silver', 'Gold'], ordered=True)
# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']
# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()
# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')
# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area()
# plt.show()
plt.savefig("_dummyPy072.png", bbox_inches="tight")
plt.clf()
## USA 4335
## URS 2049
## GBR 1594
## FRA 1314
## ITA 1228
## GER 1211
## AUS 1075
## HUN 1053
## SWE 1021
## GDR 825
## NED 782
## JPN 704
## CHN 679
## RUS 638
## ROU 624
## Name: NOC, dtype: int64
## Medal Bronze Gold Silver totals
## NOC
## USA 1052.0 2088.0 1195.0 4335.0
## URS 584.0 838.0 627.0 2049.0
## GBR 505.0 498.0 591.0 1594.0
## FRA 475.0 378.0 461.0 1314.0
## ITA 374.0 460.0 394.0 1228.0
## GER 454.0 407.0 350.0 1211.0
## AUS 413.0 293.0 369.0 1075.0
## HUN 345.0 400.0 308.0 1053.0
## SWE 325.0 347.0 349.0 1021.0
## GDR 225.0 329.0 271.0 825.0
## NED 320.0 212.0 250.0 782.0
## JPN 270.0 206.0 228.0 704.0
## CHN 193.0 234.0 252.0 679.0
## RUS 240.0 192.0 206.0 638.0
## ROU 282.0 155.0 187.0 624.0
## Event_gender Gender
## 0 M Men
## 348 X Men
## 416 W Women
## 639 X Women
## 23675 W Men
## City Edition Sport Discipline Athlete NOC Event \
## Event_gender Gender
## M Men 20067 20067 20067 20067 20067 20067 20067
## W Men 1 1 1 1 1 1 1
## Women 7277 7277 7277 7277 7277 7277 7277
## X Men 1653 1653 1653 1653 1653 1653 1653
## Women 218 218 218 218 218 218 218
##
## Medal
## Event_gender Gender
## M Men 20067
## W Men 1
## Women 7277
## X Men 1653
## Women 218
## City Edition Sport Discipline Athlete NOC Gender \
## 23675 Sydney 2000 Athletics Athletics CHEPCHUMBA, Joyce KEN Men
##
## Event Event_gender Medal
## 23675 marathon W Bronze
## NOC
## USA 34
## GBR 31
## FRA 28
## GER 26
## CHN 24
## AUS 22
## ESP 22
## CAN 22
## SWE 21
## URS 21
## ITA 21
## NED 20
## RUS 20
## JPN 20
## DEN 19
## Name: Sport, dtype: int64
## NOC
## URS 21
## USA 20
## Name: Sport, dtype: int64
## URS 8
## USA 2
## dtype: int64
Summer Olympics - USA Medals:
Summer Olympics - USA Medals:
Summer Olympics - USA Medals:
Chapter 1 - Preparing data
Reading multiple data files - many tools such as pd.read_csv(), pd.read_excel(), pd.read_html(), pd.read_json():
Reindexing DataFrames - essential for combining DataFrames, since indices are the means by which DataFrames are combined:
Arithmetic with Series and DataFrames - generally, scalar operations can be broadcast in Python:
Example code includes:
myPath = "./PythonInputFiles/"
# Import pandas
import pandas as pd
medals = pd.read_csv(myPath + "summerOlympics_Medalists_1896_2008.csv", header=4)
# Read 'Bronze.csv' into a DataFrame: bronze
# bronze = pd.read_csv("Bronze.csv")
bronze = medals.loc[medals["Medal"] == "Bronze"]
# Read 'Silver.csv' into a DataFrame: silver
# silver = pd.read_csv("Silver.csv")
silver = medals.loc[medals["Medal"] == "Silver"]
# Read 'Gold.csv' into a DataFrame: gold
# gold = pd.read_csv("Gold.csv")
gold = medals.loc[medals["Medal"] == "Gold"]
# Print the first five rows of gold
print(gold.head())
bronze.to_csv(myPath + "olymBronze.csv", index=False)
silver.to_csv(myPath + "olymSilver.csv", index=False)
gold.to_csv(myPath + "olymGold.csv", index=False)
# One time only - for use in next section
# bronze[["NOC", "Athlete"]].groupby("NOC").count().sort_values("Athlete", ascending=False).iloc[0:5, :].to_csv(myPath + "bronze_top5.csv")
# silver[["NOC", "Athlete"]].groupby("NOC").count().sort_values("Athlete", ascending=False).iloc[0:5, :].to_csv(myPath + "silver_top5.csv")
# gold[["NOC", "Athlete"]].groupby("NOC").count().sort_values("Athlete", ascending=False).iloc[0:5, :].to_csv(myPath + "gold_top5.csv")
# Create the list of file names: filenames
filenames = ['olymGold.csv', 'olymSilver.csv', 'olymBronze.csv']
# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
dataframes.append(pd.read_csv(myPath + filename, encoding="latin-1"))
# Print top 5 rows of 1st DataFrame in dataframes
print(dataframes[0].head())
uqNOC = set(list(gold["NOC"].unique()) + list(silver["NOC"].unique()) + list(bronze["NOC"].unique()))
totGold = gold["NOC"].value_counts()
totSilver = silver["NOC"].value_counts()
totBronze = bronze["NOC"].value_counts()
totDF = pd.DataFrame( {"Gold":totGold, "Silver":totSilver, "Bronze":totBronze} ).fillna(0)
totDF["Total"] = totDF["Gold"] + totDF["Silver"] + totDF["Bronze"]
totDF = totDF[["Total", "Gold", "Silver", "Bronze"]]
totDF = totDF.sort_values("Total", ascending=False)
print(totDF.head(20))
# The sole variable is called "Max TemperatureF" with the index being called "Month"
maxTemps = [68, 60, 68, 84, 88, 89, 91, 86, 90, 84, 72, 68]
maxIndex = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
# Read 'monthly_max_temp.csv' into a DataFrame: weather1
# weather1 = pd.read_csv('monthly_max_temp.csv', index_col="Month")
weather1 = pd.DataFrame( {"Max TemperatureF":maxTemps}, index=maxIndex )
# Print the head of weather1
print(weather1.head())
# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()
# Print the head of weather2
print(weather2.head())
# Sort the index of weather1 in reverse alphabetical order: weather3
weather3 = weather1.sort_index(ascending=False)
# Print the head of weather3
print(weather3.head())
# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 = weather1.sort_values("Max TemperatureF")
# Print the head of weather4
print(weather4.head())
# The variable is called "Mean TemperatureF" and the indexing is run by "Month"
# The dataset is then called weather1
meanTemps = [61.956043956043956, 32.133333333333333, 68.934782608695656, 43.434782608695649]
meanIndex = ["Apr", "Jan", "Jul", "Oct"]
year = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
weather1 = pd.DataFrame( {"Mean TemperatureF":meanTemps}, index=meanIndex )
print(weather1.head())
# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)
# Print weather2
print(weather2)
# Reindex weather1 using the list year with forward-fill: weather3
weather3 = weather1.reindex(year).ffill()
# Print weather3
print(weather3)
# Baby names data is from https://www.data.gov/developers/baby-names-dataset/
yob1881 = pd.read_csv(myPath + "yob1881.txt", header=None)
yob1981 = pd.read_csv(myPath + "yob1981.txt", header=None)
yob1881.columns = ["Name", "Gender", "Count"]
yob1981.columns = ["Name", "Gender", "Count"]
yob1881 = yob1881.set_index("Name").sort_values("Count", ascending=False)
yob1981 = yob1981.set_index("Name").sort_values("Count", ascending=False)
print(yob1881.shape)
print(yob1981.shape)
print(yob1881.head(12))
print(yob1981.head(12))
# Reindex names_1981 with index of names_1881: common_names
# Take only top-200 names by year
pop1881 = yob1881.iloc[0:200, :]
pop1981 = yob1981.iloc[0:200, :]
common_names = pop1981.reindex(pop1881.index)
# Print shape of common_names
print(common_names.shape)
print(common_names.head(12))
# Drop rows with null counts: common_names
common_names = common_names.dropna()
# Print shape of new common_names
print(common_names.shape)
print(common_names.head(12))
# weather is 365x22 representing 2013 Pittsburgh weather data from Weather Underground
# Used package "weatherData" to grab this from R
# KPIT2013 <- weatherData::getWeatherForDate("KPIT", "2013-01-01", "2013-12-31", opt_all_columns = TRUE)
# write.csv(KPIT2013, "./PythonInputFiles/KPIT2013.csv", row.names=FALSE)
weather = pd.read_csv(myPath + "KPIT2013.csv")
# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[['Min_TemperatureF', 'Mean_TemperatureF', 'Max_TemperatureF']]
# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * (5/9)
# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns = temps_c.columns.str.replace("F", "C")
# Print first 5 rows of temps_c
print(temps_c.head())
# Quarterly US GDP data from 1947-01-01 to 2016-04-01
# Downloaded from https://fred.stlouisfed.org/series/GDP as myPath + "US_GDP_1947_2016_StLouisFRED.csv"
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv(myPath + "US_GDP_1947_2016_StLouisFRED.csv", parse_dates=True, index_col="DATE")
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc["2008-01-01":, :]
# Print the last 8 rows of post2008
print(post2008.tail(8))
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample("A").last()
# Print yearly
print(yearly)
# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change()*100
# Print yearly again
print(yearly)
# Import pandas
# import pandas as pd
# Read 'sp500.csv' into a DataFrame: sp500
# sp500 = pd.read_csv("sp500.csv", parse_dates=True, index_col="Date")
# Read 'exchange.csv' into a DataFrame: exchange
# exchange = pd.read_csv("exchange.csv", parse_dates=True, index_col="Date")
# Subset 'Open' & 'Close' columns from sp500: dollars
# dollars = sp500.loc[:, ["Open", "Close"]]
# Print the head of dollars
# print(dollars.head())
# Convert dollars to pounds: pounds
# pounds = dollars.multiply(exchange["GBP/USD"], axis="rows")
# Print the head of pounds
# print(pounds.head())
## City Edition Sport Discipline Athlete NOC Gender \
## 0 Athens 1896 Aquatics Swimming HAJOS, Alfred HUN Men
## 3 Athens 1896 Aquatics Swimming MALOKINIS, Ioannis GRE Men
## 6 Athens 1896 Aquatics Swimming HAJOS, Alfred HUN Men
## 9 Athens 1896 Aquatics Swimming NEUMANN, Paul AUT Men
## 13 Athens 1896 Athletics Athletics BURKE, Thomas USA Men
##
## Event Event_gender Medal
## 0 100m freestyle M Gold
## 3 100m freestyle for sailors M Gold
## 6 1200m freestyle M Gold
## 9 400m freestyle M Gold
## 13 100m M Gold
## City Edition Sport Discipline Athlete NOC Gender \
## 0 Athens 1896 Aquatics Swimming HAJOS, Alfred HUN Men
## 1 Athens 1896 Aquatics Swimming MALOKINIS, Ioannis GRE Men
## 2 Athens 1896 Aquatics Swimming HAJOS, Alfred HUN Men
## 3 Athens 1896 Aquatics Swimming NEUMANN, Paul AUT Men
## 4 Athens 1896 Athletics Athletics BURKE, Thomas USA Men
##
## Event Event_gender Medal
## 0 100m freestyle M Gold
## 1 100m freestyle for sailors M Gold
## 2 1200m freestyle M Gold
## 3 400m freestyle M Gold
## 4 100m M Gold
## Total Gold Silver Bronze
## USA 4335.0 2088.0 1195.0 1052.0
## URS 2049.0 838.0 627.0 584.0
## GBR 1594.0 498.0 591.0 505.0
## FRA 1314.0 378.0 461.0 475.0
## ITA 1228.0 460.0 394.0 374.0
## GER 1211.0 407.0 350.0 454.0
## AUS 1075.0 293.0 369.0 413.0
## HUN 1053.0 400.0 308.0 345.0
## SWE 1021.0 347.0 349.0 325.0
## GDR 825.0 329.0 271.0 225.0
## NED 782.0 212.0 250.0 320.0
## JPN 704.0 206.0 228.0 270.0
## CHN 679.0 234.0 252.0 193.0
## RUS 638.0 192.0 206.0 240.0
## ROU 624.0 155.0 187.0 282.0
## CAN 592.0 154.0 211.0 227.0
## NOR 537.0 194.0 199.0 144.0
## POL 499.0 103.0 173.0 223.0
## DEN 491.0 147.0 192.0 152.0
## FRG 490.0 143.0 167.0 180.0
## Max TemperatureF
## Jan 68
## Feb 60
## Mar 68
## Apr 84
## May 88
## Max TemperatureF
## Apr 84
## Aug 86
## Dec 68
## Feb 60
## Jan 68
## Max TemperatureF
## Sep 90
## Oct 84
## Nov 72
## May 88
## Mar 68
## Max TemperatureF
## Feb 60
## Jan 68
## Mar 68
## Dec 68
## Nov 72
## Mean TemperatureF
## Apr 61.956044
## Jan 32.133333
## Jul 68.934783
## Oct 43.434783
## Mean TemperatureF
## Jan 32.133333
## Feb NaN
## Mar NaN
## Apr 61.956044
## May NaN
## Jun NaN
## Jul 68.934783
## Aug NaN
## Sep NaN
## Oct 43.434783
## Nov NaN
## Dec NaN
## Mean TemperatureF
## Jan 32.133333
## Feb 32.133333
## Mar 32.133333
## Apr 61.956044
## May 61.956044
## Jun 61.956044
## Jul 68.934783
## Aug 68.934783
## Sep 68.934783
## Oct 43.434783
## Nov 43.434783
## Dec 43.434783
## (1935, 2)
## (19471, 2)
## Gender Count
## Name
## John M 8769
## William M 8524
## Mary F 6919
## James M 5441
## George M 4664
## Charles M 4636
## Frank M 2834
## Anna F 2698
## Joseph M 2456
## Henry M 2339
## Thomas M 2282
## Edward M 2177
## Gender Count
## Name
## Michael M 68765
## Jennifer F 57046
## Christopher M 50228
## Matthew M 43324
## Jessica F 42530
## Jason M 41926
## David M 40647
## Joshua M 39054
## James M 38307
## John M 34881
## Robert M 34396
## Amanda F 34372
## (200, 2)
## Gender Count
## Name
## John M 34881.0
## William M 24803.0
## Mary F 11040.0
## James M 38307.0
## George M 5159.0
## Charles M 14428.0
## Frank M 3637.0
## Anna F 5189.0
## Joseph M 30771.0
## Henry NaN NaN
## Thomas M 17165.0
## Edward M 6657.0
## (42, 2)
## Gender Count
## Name
## John M 34881.0
## William M 24803.0
## Mary F 11040.0
## James M 38307.0
## George M 5159.0
## Charles M 14428.0
## Frank M 3637.0
## Anna F 5189.0
## Joseph M 30771.0
## Thomas M 17165.0
## Edward M 6657.0
## Robert M 34396.0
## Min_TemperatureC Mean_TemperatureC Max_TemperatureC
## 0 -6.111111 -2.777778 0.000000
## 1 -10.000000 -6.666667 -3.888889
## 2 -14.444444 -6.666667 0.555556
## 3 -3.333333 -1.666667 0.000000
## 4 -4.444444 -1.111111 1.666667
## GDP
## DATE
## 2015-04-01 17998.3
## 2015-07-01 18141.9
## 2015-10-01 18222.8
## 2016-01-01 18281.6
## 2016-04-01 18450.1
## 2016-07-01 18675.3
## 2016-10-01 18869.4
## 2017-01-01 19027.6
## GDP
## DATE
## 2008-12-31 14549.9
## 2009-12-31 14566.5
## 2010-12-31 15230.2
## 2011-12-31 15785.3
## 2012-12-31 16297.3
## 2013-12-31 16999.9
## 2014-12-31 17692.2
## 2015-12-31 18222.8
## 2016-12-31 18869.4
## 2017-12-31 19027.6
## GDP growth
## DATE
## 2008-12-31 14549.9 NaN
## 2009-12-31 14566.5 0.114090
## 2010-12-31 15230.2 4.556345
## 2011-12-31 15785.3 3.644732
## 2012-12-31 16297.3 3.243524
## 2013-12-31 16999.9 4.311144
## 2014-12-31 17692.2 4.072377
## 2015-12-31 18222.8 2.999062
## 2016-12-31 18869.4 3.548302
## 2017-12-31 19027.6 0.838394
Chapter 2 - Concatenating Data
Appending and concatenating Series - using .append() or pd.concat():
Appending and concatenating DataFrames:
Concatenation, keys, and MultiIndexes:
Outer and Inner Joins:
Example code includes:
myPath = "./PythonInputFiles/"
import pandas as pd
import numpy as np
import random
# Do not have these .csv files
# Created dummy data and saved .csv to myPath
# keyDates = pd.date_range("2015-01-01", "2015-03-31")
# utHardware = [random.randint(2, 10) for p in range(len(keyDates))]
# utSoftware = [random.randint(1, 50) for p in range(len(keyDates))]
# utService = [random.randint(0, 200) for p in range(len(keyDates))]
# totSales = pd.DataFrame( {"Date":[str(x).split()[0] for x in keyDates], "Hardware":utHardware, "Software":utSoftware, "Service":utService } )
# totSales["Units"] = totSales["Hardware"] + totSales["Software"] + totSales["Service"]
# totSales["Company"] = ["A", "B", "C"] * 30
# totSales.iloc[:31, :].to_csv(myPath + "sales-jan-2015.csv", index=False)
# totSales.iloc[31:59, :].to_csv(myPath + "sales-feb-2015.csv", index=False)
# totSales.iloc[59:, :].to_csv(myPath + "sales-mar-2015.csv", index=False)
# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv(myPath + "sales-jan-2015.csv", parse_dates=True, index_col="Date")
# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv(myPath + "sales-feb-2015.csv", parse_dates=True, index_col="Date")
# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv(myPath + "sales-mar-2015.csv", parse_dates=True, index_col="Date")
# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']
# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']
# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']
# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = jan_units.append(feb_units).append(mar_units)
# Print the first slice from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
# Print the second slice from quarter1
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])
# Compute & print total sales in quarter1
print(quarter1.sum())
# Initialize empty list: units
units = []
# Build the list of Series
for month in [jan, feb, mar]:
units.append(month["Units"])
# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis="rows")
# Print slices from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])
# Refers back to the names datasets from earlier in these chapters
yob1881 = pd.read_csv(myPath + "yob1881.txt", header=None)
yob1981 = pd.read_csv(myPath + "yob1981.txt", header=None)
yob1881.columns = ["Name", "Gender", "Count"]
yob1981.columns = ["Name", "Gender", "Count"]
names_1881 = yob1881.sort_values("Count", ascending=False)
names_1981 = yob1981.sort_values("Count", ascending=False)
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981
# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981, ignore_index=True)
# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)
# Print all rows that contain the name 'Morgan'
print(combined_names.loc[combined_names["Name"].str.contains("Morgan"), :])
# These data are the 4x1 of quarterly data from above in this workbook (Mean is actually the 12x1 with Max being the 4x1)
# The sole variable is called "Max TemperatureF" with the index being called "Month"
maxTemps = [68, 60, 68, 84, 88, 89, 91, 86, 90, 84, 72, 68]
maxIndex = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
meanTemps = [61.956043956043956, 32.133333333333333, 68.934782608695656, 43.434782608695649]
meanIndex = ["Apr", "Jan", "Jul", "Oct"]
weather_max = pd.DataFrame( {"Max TemperatureF":maxTemps}, index=maxIndex)
weather_mean = pd.DataFrame( {"Mean TemperatureF":meanTemps}, index=meanIndex)
# Concatenate weather_max and weather_mean horizontally: weather
weather = pd.concat([weather_max, weather_mean], axis=1).reindex(weather_max.index)
# Print weather
print(weather)
# This uses the Olympics medal datasets from previous
medal_types = ['bronze', 'silver', 'gold']
medals = []
for medal in medal_types:
# Create the file name: file_name
file_name = myPath + "%s_top5.csv" % medal # Note that the %s followed later by % medal means to replace the %s with the value of medal
# Create list of column names: columns
columns = ['Country', medal]
# Read file_name into a DataFrame: df
medal_df = pd.read_csv(file_name, header=0, index_col="Country", names=columns)
# Append medal_df to medals
medals.append(medal_df)
# Concatenate medals horizontally: medals
medals = pd.concat(medals, axis="columns")
# Print medals
print(medals)
medals = []
for medal in medal_types:
file_name = myPath + "%s_top5.csv" % medal
# Read file_name into a DataFrame: medal_df
medal_df = pd.read_csv(file_name, index_col="NOC")
# Append medal_df to medals
medals.append(medal_df)
# Concatenate medals: medals
medals = pd.concat(medals, keys=['bronze', 'silver', 'gold'])
# Print medals in entirety
print(medals)
# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)
# Print the number of Bronze medals won by Germany
print(medals_sorted.loc[('bronze','GER')])
# Print data about silver medals
print(medals_sorted.loc['silver'])
# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice
# Print all the data on medals won by the United Kingdom
print(medals_sorted.loc[idx[:,'GBR'], :])
# DO NOT HAVE THESE FILES - PROBABLY LINKED TO THE "sales" INPUTS FROM ABOVE
# Concatenate dataframes: february
# february = pd.concat(dataframes, axis=1, keys=['Hardware', 'Software', 'Service'])
# Print february.info()
# print(february.info())
# Assign pd.IndexSlice: idx
# idx = pd.IndexSlice
# Create the slice: slice_2_8
# slice_2_8 = february.loc['2015-02-02':'2015-02-08', idx[:, 'Company']]
# Print slice_2_8
# print(slice_2_8)
# CONTINUES TO BE jan/feb/mar FROM PREVIOUS "sales" INPUTS
# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]
# Create an empty dictionary: month_dict
month_dict = {}
for month_name, month_data in month_list:
# Group month_data: month_dict[month_name]
month_dict[month_name] = month_data.groupby("Company").sum()
# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)
# Print sales
print(sales)
# Print all sales by 'A'
idx = pd.IndexSlice
print(sales.loc[idx[:, 'A'], :])
# Again, the Olympics datasets (specifically, top-5 by medal type)
bronze_top5=pd.read_csv(myPath + "bronze_top5.csv", index_col="NOC")
silver_top5=pd.read_csv(myPath + "silver_top5.csv", index_col="NOC")
gold_top5=pd.read_csv(myPath + "gold_top5.csv", index_col="NOC")
# Create the list of DataFrames: medal_list
medal_list = [bronze_top5, silver_top5, gold_top5]
# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list, axis=1, join="inner", keys=['bronze', 'silver', 'gold'])
medals.columns = ['bronze', 'silver', 'gold']
# Print medals
print(medals)
# US is quartely GDP starting 1947
# China is annual GDP starting 1966
# Resample and tidy china: china_annual
# china_annual = china.resample("A").pct_change(10).dropna()
# Resample and tidy us: us_annual
# us_annual = us.resample("A").pct_change(10).dropna()
# Concatenate china_annual and us_annual: gdp
# gdp = pd.concat([china_annual, us_annual], join="inner", axis=1)
# Resample gdp and print
# print(gdp.resample('10A').last())
## Date
## 2015-01-27 200
## 2015-01-28 223
## 2015-01-29 176
## 2015-01-30 124
## 2015-01-31 116
## 2015-02-01 116
## 2015-02-02 168
## Name: Units, dtype: int64
## Date
## 2015-02-26 234
## 2015-02-27 203
## 2015-02-28 118
## 2015-03-01 136
## 2015-03-02 31
## 2015-03-03 191
## 2015-03-04 80
## 2015-03-05 38
## 2015-03-06 111
## 2015-03-07 129
## Name: Units, dtype: int64
## 11979
## Date
## 2015-01-27 200
## 2015-01-28 223
## 2015-01-29 176
## 2015-01-30 124
## 2015-01-31 116
## 2015-02-01 116
## 2015-02-02 168
## Name: Units, dtype: int64
## Date
## 2015-02-26 234
## 2015-02-27 203
## 2015-02-28 118
## 2015-03-01 136
## 2015-03-02 31
## 2015-03-03 191
## 2015-03-04 80
## 2015-03-05 38
## 2015-03-06 111
## 2015-03-07 129
## Name: Units, dtype: int64
## (19471, 4)
## (1935, 4)
## (21406, 4)
## Name Gender Count year
## 680 Morgan M 23 1881
## 2249 Morgan F 1769 1981
## 2521 Morgan M 766 1981
## 10117 Morgana F 14 1981
## 13078 Morgann F 9 1981
## 19844 Morganne F 5 1981
## Max TemperatureF Mean TemperatureF
## Jan 68 32.133333
## Feb 60 NaN
## Mar 68 NaN
## Apr 84 61.956044
## May 88 NaN
## Jun 89 NaN
## Jul 91 68.934783
## Aug 86 NaN
## Sep 90 NaN
## Oct 84 43.434783
## Nov 72 NaN
## Dec 68 NaN
## bronze silver gold
## FRA 475.0 461.0 NaN
## GBR 505.0 591.0 498.0
## GER 454.0 NaN 407.0
## ITA NaN 394.0 460.0
## URS 584.0 627.0 838.0
## USA 1052.0 1195.0 2088.0
## Athlete
## NOC
## bronze USA 1052
## URS 584
## GBR 505
## FRA 475
## GER 454
## silver USA 1195
## URS 627
## GBR 591
## FRA 461
## ITA 394
## gold USA 2088
## URS 838
## GBR 498
## ITA 460
## GER 407
## Athlete 454
## Name: (bronze, GER), dtype: int64
## Athlete
## NOC
## FRA 461
## GBR 591
## ITA 394
## URS 627
## USA 1195
## Athlete
## NOC
## bronze GBR 505
## gold GBR 498
## silver GBR 591
## Hardware Service Software Units
## Company
## february A 47 986 210 1243
## B 70 1092 242 1404
## C 41 966 189 1196
## january A 72 1133 252 1457
## B 68 1117 188 1373
## C 50 1037 277 1364
## march A 66 667 247 980
## B 56 1137 303 1496
## C 65 1139 262 1466
## Hardware Service Software Units
## Company
## february A 47 986 210 1243
## january A 72 1133 252 1457
## march A 66 667 247 980
## bronze silver gold
## NOC
## USA 1052 1195 2088
## URS 584 627 838
## GBR 505 591 498
Chapter 3 - Merging Data
Merging DataFrames - an extension of concatenation that allows for merging on things other than the index:
Joining DataFrames - various types of joins, and implications on processing efficency:
Ordered merges - DataFrames where the underlying data has a natural order (such as time series data):
Example code includes:
myPath = "./PythonInputFiles/"
import pandas as pd
revenue = pd.DataFrame({"branch_id" : [10, 20, 30, 47] , "city" : ["Austin", "Denver", "Springfield", "Mendocino"] , "revenue" : [100, 83, 4, 200] } )
managers = pd.DataFrame({"branch_id" : [10, 20, 47, 31] , "city" : ["Austin", "Denver", "Mendocino", "Springfield"] , "manager" : ["Charles", "Joel", "Brett", "Sally"] } )
# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue, managers, on="city")
# Print merge_by_city
print(merge_by_city)
# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue, managers, on="branch_id")
# Print merge_by_id
print(merge_by_id)
revenue["state"] = ["TX", "CO", "IL", "CA"]
managers["state"] = ["TX", "CO", "CA", "MO"]
managers=managers.iloc[:, [1, 0, 2, 3]]
managers.columns = ["branch", "branch_id", "manager", "state"]
# Merge revenue & managers on 'city' & 'branch': combined
combined = pd.merge(revenue, managers, left_on="city", right_on="branch")
# Print combined
print(combined)
# Add 'state' column to revenue: revenue['state']
# revenue['state'] = ['TX','CO','IL','CA'] # already handled above
# Add 'state' column to managers: managers['state']
# managers['state'] = ['TX','CO','CA','MO'] # already handled above
managers = managers.iloc[:, [1, 0, 2, 3]] # get back to how it was
managers.columns = ["branch_id", "city", "manager", "state"]
# Merge revenue & managers on 'branch_id', 'city', & 'state': combined
combined = pd.merge(revenue, managers, on=["branch_id", "city", "state"])
# Print combined
print(combined)
sales = pd.DataFrame( { "city" : ["Mendocino", "Denver", "Austin", "Springield", "Springfield"] , "state" : ["CA", "CO", "TX", "MO", "IL"] , "units" : [1, 4, 2, 5, 1] } )
managers=managers.iloc[:, [1, 0, 2, 3]]
managers.columns = ["branch", "branch_id", "manager", "state"]
# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue, sales, how="right", on=['city', 'state'])
# Print revenue_and_sales
print(revenue_and_sales)
# Merge sales and managers: sales_and_managers
sales_and_managers = pd.merge(sales, managers, how="left", left_on=['city', 'state'], right_on=['branch', 'state'])
# Print sales_and_managers
print(sales_and_managers)
# Perform the first merge: merge_default
merge_default = pd.merge(sales_and_managers, revenue_and_sales)
# Print merge_default
print(merge_default)
# Perform the second merge: merge_outer
merge_outer = pd.merge(sales_and_managers, revenue_and_sales, how="outer")
# Print merge_outer
print(merge_outer)
# Perform the third merge: merge_outer_on
merge_outer_on = pd.merge(sales_and_managers, revenue_and_sales, on=['city','state'], how="outer")
# Print merge_outer_on
print(merge_outer_on)
austin = pd.DataFrame( { "date":pd.to_datetime(["2016-01-01", "2016-02-08", "2016-01-17"]), "ratings" : ["Cloudy", "Cloudy", "Sunny"] } )
houston = pd.DataFrame( { "date":pd.to_datetime(["2016-01-04", "2016-01-01", "2016-03-01"]), "ratings" : ["Rainy", "Cloudy", "Sunny"] } )
# Perform the first ordered merge: tx_weather
tx_weather = pd.merge_ordered(austin, houston)
# Print tx_weather
print(tx_weather)
# Perform the second ordered merge: tx_weather_suff
tx_weather_suff = pd.merge_ordered(austin, houston, on="date", suffixes=['_aus','_hus'])
# Print tx_weather_suff
print(tx_weather_suff)
# Perform the third ordered merge: tx_weather_ffill
tx_weather_ffill = pd.merge_ordered(austin, houston, on="date", suffixes=['_aus','_hus'], fill_method="ffill")
# Print tx_weather_ffill
print(tx_weather_ffill)
# Similar to pd.merge_ordered(), the pd.merge_asof() function will also merge values in order using the on column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept.
# DO NOT HAVE THESE DATASETS
# Merge auto and oil: merged
# merged = pd.merge_asof(auto, oil, left_on="yr", right_on="Date")
# Print the tail of merged
# print(merged.tail())
# Resample merged: yearly
# yearly = merged.resample("A", on="Date")[['mpg','Price']].mean()
# Print yearly
# print(yearly)
# print yearly.corr()
# print(yearly.corr())
## branch_id_x city revenue branch_id_y manager
## 0 10 Austin 100 10 Charles
## 1 20 Denver 83 20 Joel
## 2 30 Springfield 4 31 Sally
## 3 47 Mendocino 200 47 Brett
## branch_id city_x revenue city_y manager
## 0 10 Austin 100 Austin Charles
## 1 20 Denver 83 Denver Joel
## 2 47 Mendocino 200 Mendocino Brett
## branch_id_x city revenue state_x branch branch_id_y \
## 0 10 Austin 100 TX Austin 10
## 1 20 Denver 83 CO Denver 20
## 2 30 Springfield 4 IL Springfield 31
## 3 47 Mendocino 200 CA Mendocino 47
##
## manager state_y
## 0 Charles TX
## 1 Joel CO
## 2 Sally MO
## 3 Brett CA
## branch_id city revenue state manager
## 0 10 Austin 100 TX Charles
## 1 20 Denver 83 CO Joel
## 2 47 Mendocino 200 CA Brett
## branch_id city revenue state units
## 0 10.0 Austin 100.0 TX 2
## 1 20.0 Denver 83.0 CO 4
## 2 30.0 Springfield 4.0 IL 1
## 3 47.0 Mendocino 200.0 CA 1
## 4 NaN Springield NaN MO 5
## city state units branch branch_id manager
## 0 Mendocino CA 1 Mendocino 47.0 Brett
## 1 Denver CO 4 Denver 20.0 Joel
## 2 Austin TX 2 Austin 10.0 Charles
## 3 Springield MO 5 NaN NaN NaN
## 4 Springfield IL 1 NaN NaN NaN
## city state units branch branch_id manager revenue
## 0 Mendocino CA 1 Mendocino 47.0 Brett 200.0
## 1 Denver CO 4 Denver 20.0 Joel 83.0
## 2 Austin TX 2 Austin 10.0 Charles 100.0
## 3 Springield MO 5 NaN NaN NaN NaN
## city state units branch branch_id manager revenue
## 0 Mendocino CA 1 Mendocino 47.0 Brett 200.0
## 1 Denver CO 4 Denver 20.0 Joel 83.0
## 2 Austin TX 2 Austin 10.0 Charles 100.0
## 3 Springield MO 5 NaN NaN NaN NaN
## 4 Springfield IL 1 NaN NaN NaN NaN
## 5 Springfield IL 1 NaN 30.0 NaN 4.0
## city state units_x branch branch_id_x manager branch_id_y \
## 0 Mendocino CA 1 Mendocino 47.0 Brett 47.0
## 1 Denver CO 4 Denver 20.0 Joel 20.0
## 2 Austin TX 2 Austin 10.0 Charles 10.0
## 3 Springield MO 5 NaN NaN NaN NaN
## 4 Springfield IL 1 NaN NaN NaN 30.0
##
## revenue units_y
## 0 200.0 1
## 1 83.0 4
## 2 100.0 2
## 3 NaN 5
## 4 4.0 1
## date ratings
## 0 2016-01-01 Cloudy
## 1 2016-01-04 Rainy
## 2 2016-01-17 Sunny
## 3 2016-02-08 Cloudy
## 4 2016-03-01 Sunny
## date ratings_aus ratings_hus
## 0 2016-01-01 Cloudy Cloudy
## 1 2016-01-04 NaN Rainy
## 2 2016-01-17 Sunny NaN
## 3 2016-02-08 Cloudy NaN
## 4 2016-03-01 NaN Sunny
## date ratings_aus ratings_hus
## 0 2016-01-01 Cloudy Cloudy
## 1 2016-01-04 Cloudy Rainy
## 2 2016-01-17 Sunny Rainy
## 3 2016-02-08 Cloudy Rainy
## 4 2016-03-01 Cloudy Sunny
Chapter 4 - Case Study (Summer Olympics)
Medals in the Summer Olympics - does a country win more medals when it is the host?:
Quantifying Performance:
Reshaping and plotting:
Example code includes:
myPath = "./PythonInputFiles/"
import pandas as pd
import matplotlib.pyplot as plt
# Create files needed for reading in later
# medals = pd.read_csv(myPath + "summerOlympics_Medalists_1896_2008.csv", header=4)
# uqYears = medals["Edition"].value_counts().sort_index().index
# for x in uqYears:
# outFile = myPath + '_notuse_summer_{:d}.csv'.format(x)
# outData = medals.loc[medals["Edition"] == x]
# outData.to_csv(outFile, index=False)
#
# Create file path: file_path
file_path = myPath + "summerOlympics_Hosts_1896_2008.txt"
# Load DataFrame from file_path: editions
editions = pd.read_csv(file_path, sep="\t")
# Extract the relevant columns: editions
editions = editions[['Edition', 'Grand Total', 'City', 'Country']]
# Print editions DataFrame
print(editions)
# Create the file path: file_path
file_path = myPath + 'olympicsCountryCodes.csv'
# Load DataFrame from file_path: ioc_codes
ioc_codes = pd.read_csv(file_path)
ioc_codes.columns = ["Country", "NOC", "ISO", "Country_1"]
# Extract the relevant columns: ioc_codes
ioc_codes = ioc_codes[["Country", "NOC"]]
# Print first and last 5 rows of ioc_codes
print(ioc_codes.head())
print(ioc_codes.tail())
# Create empty dictionary: medals_dict
medals_dict = {}
for year in editions['Edition']:
# Create the file path: file_path
file_path = myPath + '_notuse_summer_{:d}.csv'.format(year)
# Load file_path into a DataFrame: medals_dict[year]
medals_dict[year] = pd.read_csv(file_path, encoding="latin-1")
# Extract relevant columns: medals_dict[year]
medals_dict[year] = medals_dict[year][['Athlete', 'NOC', 'Medal']]
# Assign year to column 'Edition' of medals_dict
medals_dict[year]['Edition'] = year
# Concatenate medals_dict: medals
medals = pd.concat(medals_dict, ignore_index=True)
# Print first and last 5 rows of medals
print(medals.head())
print(medals.tail())
# Construct the pivot_table: medal_counts
medal_counts = medals.pivot_table(index="Edition", columns="NOC", values="Athlete", aggfunc="count")
# Print the first & last 5 rows of medal_counts
print(medal_counts.head())
print(medal_counts.tail())
# Set Index of editions: totals
totals = editions.set_index("Edition")
# Reassign totals['Grand Total']: totals
totals = totals["Grand Total"]
# Divide medal_counts by totals: fractions
fractions = medal_counts.divide(totals, axis="rows")
# Print first & last 5 rows of fractions
print(fractions.head())
print(fractions.tail())
# CHECK IN TO WHAT THE .expanding() does here . . .
# Apply the expanding mean: mean_fractions
mean_fractions = fractions.expanding().mean()
# Compute the percentage change: fractions_change
fractions_change = mean_fractions.pct_change() * 100
# Reset the index of fractions_change: fractions_change
fractions_change = fractions_change.reset_index()
# Print first & last 5 rows of fractions_change
print(fractions_change.head())
print(fractions_change.tail())
# Left join editions and ioc_codes: hosts
hosts = pd.merge(editions, ioc_codes, how="left")
# Extract relevant columns and set index: hosts
hosts = hosts[["Edition", "NOC"]].set_index("Edition")
# Fix missing 'NOC' values of hosts
print(hosts.loc[hosts.NOC.isnull()])
hosts.loc[1972, 'NOC'] = 'FRG'
hosts.loc[1980, 'NOC'] = 'URS'
hosts.loc[1988, 'NOC'] = 'KOR'
# Reset Index of hosts: hosts
hosts = hosts.reset_index()
# Print hosts
print(hosts)
# Reshape fractions_change: reshaped
reshaped = pd.melt(fractions_change, id_vars="Edition", value_name="Change")
# Print reshaped.shape and fractions_change.shape
print(reshaped.shape, fractions_change.shape)
# Extract rows from reshaped where 'NOC' == 'CHN': chn
chn = reshaped[reshaped["NOC"] == "CHN"]
# Print last 5 rows of chn with .tail()
print(chn.tail())
# Merge reshaped and hosts: merged
merged = pd.merge(reshaped, hosts, how="inner")
# Print first 5 rows of merged
print(merged.head())
# Set Index of merged and sort it: influence
influence = merged.set_index("Edition").sort_index()
# Print first 5 rows of influence
print(influence.head())
# Import pyplot
import matplotlib.pyplot as plt
# Extract influence['Change']: change
change = influence["Change"]
# Make bar plot of change: ax
ax = change.plot(kind="bar")
# Customize the plot to improve readability
ax.set_ylabel("% Change of Host Country Medal Count")
ax.set_title("Is there a Host Country Advantage?")
ax.set_xticklabels(editions['City'])
# Display the plot
# plt.show()
plt.savefig("_dummyPy073.png", bbox_inches="tight")
plt.clf()
## Edition Grand Total City Country
## 0 1896 151 Athens Greece
## 1 1900 512 Paris France
## 2 1904 470 St. Louis United States
## 3 1908 804 London United Kingdom
## 4 1912 885 Stockholm Sweden
## 5 1920 1298 Antwerp Belgium
## 6 1924 884 Paris France
## 7 1928 710 Amsterdam Netherlands
## 8 1932 615 Los Angeles United States
## 9 1936 875 Berlin Germany
## 10 1948 814 London United Kingdom
## 11 1952 889 Helsinki Finland
## 12 1956 885 Melbourne Australia
## 13 1960 882 Rome Italy
## 14 1964 1010 Tokyo Japan
## 15 1968 1031 Mexico City Mexico
## 16 1972 1185 Munich West Germany (now Germany)
## 17 1976 1305 Montreal Canada
## 18 1980 1387 Moscow U.S.S.R. (now Russia)
## 19 1984 1459 Los Angeles United States
## 20 1988 1546 Seoul South Korea
## 21 1992 1705 Barcelona Spain
## 22 1996 1859 Atlanta United States
## 23 2000 2015 Sydney Australia
## 24 2004 1998 Athens Greece
## 25 2008 2042 Beijing China
## Country NOC
## 0 Afghanistan AFG
## 1 Albania ALB
## 2 Algeria ALG
## 3 American Samoa* ASA
## 4 Andorra AND
## Country NOC
## 196 Vietnam VIE
## 197 Virgin Islands* ISV
## 198 Yemen YEM
## 199 Zambia ZAM
## 200 Zimbabwe ZIM
## Athlete NOC Medal Edition
## 0 HAJOS, Alfred HUN Gold 1896
## 1 HERSCHMANN, Otto AUT Silver 1896
## 2 DRIVAS, Dimitrios GRE Bronze 1896
## 3 MALOKINIS, Ioannis GRE Gold 1896
## 4 CHASAPIS, Spiridon GRE Silver 1896
## Athlete NOC Medal Edition
## 29211 ENGLICH, Mirko GER Silver 2008
## 29212 MIZGAITIS, Mindaugas LTU Bronze 2008
## 29213 PATRIKEEV, Yuri ARM Bronze 2008
## 29214 LOPEZ, Mijain CUB Gold 2008
## 29215 BAROEV, Khasan RUS Silver 2008
## NOC AFG AHO ALG ANZ ARG ARM AUS AUT AZE BAH ... URS URU \
## Edition ...
## 1896 NaN NaN NaN NaN NaN NaN 2.0 5.0 NaN NaN ... NaN NaN
## 1900 NaN NaN NaN NaN NaN NaN 5.0 6.0 NaN NaN ... NaN NaN
## 1904 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN ... NaN NaN
## 1908 NaN NaN NaN 19.0 NaN NaN NaN 1.0 NaN NaN ... NaN NaN
## 1912 NaN NaN NaN 10.0 NaN NaN NaN 14.0 NaN NaN ... NaN NaN
##
## NOC USA UZB VEN VIE YUG ZAM ZIM ZZX
## Edition
## 1896 20.0 NaN NaN NaN NaN NaN NaN 6.0
## 1900 55.0 NaN NaN NaN NaN NaN NaN 34.0
## 1904 394.0 NaN NaN NaN NaN NaN NaN 8.0
## 1908 63.0 NaN NaN NaN NaN NaN NaN NaN
## 1912 101.0 NaN NaN NaN NaN NaN NaN NaN
##
## [5 rows x 138 columns]
## NOC AFG AHO ALG ANZ ARG ARM AUS AUT AZE BAH ... URS URU \
## Edition ...
## 1992 NaN NaN 2.0 NaN 2.0 NaN 57.0 6.0 NaN 1.0 ... NaN NaN
## 1996 NaN NaN 3.0 NaN 20.0 2.0 132.0 3.0 1.0 5.0 ... NaN NaN
## 2000 NaN NaN 5.0 NaN 20.0 1.0 183.0 4.0 3.0 6.0 ... NaN 1.0
## 2004 NaN NaN NaN NaN 47.0 NaN 157.0 8.0 5.0 2.0 ... NaN NaN
## 2008 1.0 NaN 2.0 NaN 51.0 6.0 149.0 3.0 7.0 5.0 ... NaN NaN
##
## NOC USA UZB VEN VIE YUG ZAM ZIM ZZX
## Edition
## 1992 224.0 NaN NaN NaN NaN NaN NaN NaN
## 1996 260.0 2.0 NaN NaN 26.0 1.0 NaN NaN
## 2000 248.0 4.0 NaN 1.0 26.0 NaN NaN NaN
## 2004 264.0 5.0 2.0 NaN NaN NaN 3.0 NaN
## 2008 315.0 6.0 1.0 1.0 NaN NaN 4.0 NaN
##
## [5 rows x 138 columns]
## NOC AFG AHO ALG ANZ ARG ARM AUS AUT AZE BAH \
## Edition
## 1896 NaN NaN NaN NaN NaN NaN 0.013245 0.033113 NaN NaN
## 1900 NaN NaN NaN NaN NaN NaN 0.009766 0.011719 NaN NaN
## 1904 NaN NaN NaN NaN NaN NaN NaN 0.002128 NaN NaN
## 1908 NaN NaN NaN 0.023632 NaN NaN NaN 0.001244 NaN NaN
## 1912 NaN NaN NaN 0.011299 NaN NaN NaN 0.015819 NaN NaN
##
## NOC ... URS URU USA UZB VEN VIE YUG ZAM ZIM ZZX
## Edition ...
## 1896 ... NaN NaN 0.132450 NaN NaN NaN NaN NaN NaN 0.039735
## 1900 ... NaN NaN 0.107422 NaN NaN NaN NaN NaN NaN 0.066406
## 1904 ... NaN NaN 0.838298 NaN NaN NaN NaN NaN NaN 0.017021
## 1908 ... NaN NaN 0.078358 NaN NaN NaN NaN NaN NaN NaN
## 1912 ... NaN NaN 0.114124 NaN NaN NaN NaN NaN NaN NaN
##
## [5 rows x 138 columns]
## NOC AFG AHO ALG ANZ ARG ARM AUS AUT \
## Edition
## 1992 NaN NaN 0.001173 NaN 0.001173 NaN 0.033431 0.003519
## 1996 NaN NaN 0.001614 NaN 0.010758 0.001076 0.071006 0.001614
## 2000 NaN NaN 0.002481 NaN 0.009926 0.000496 0.090819 0.001985
## 2004 NaN NaN NaN NaN 0.023524 NaN 0.078579 0.004004
## 2008 0.00049 NaN 0.000979 NaN 0.024976 0.002938 0.072968 0.001469
##
## NOC AZE BAH ... URS URU USA UZB VEN \
## Edition ...
## 1992 NaN 0.000587 ... NaN NaN 0.131378 NaN NaN
## 1996 0.000538 0.002690 ... NaN NaN 0.139860 0.001076 NaN
## 2000 0.001489 0.002978 ... NaN 0.000496 0.123077 0.001985 NaN
## 2004 0.002503 0.001001 ... NaN NaN 0.132132 0.002503 0.001001
## 2008 0.003428 0.002449 ... NaN NaN 0.154261 0.002938 0.000490
##
## NOC VIE YUG ZAM ZIM ZZX
## Edition
## 1992 NaN NaN NaN NaN NaN
## 1996 NaN 0.013986 0.000538 NaN NaN
## 2000 0.000496 0.012903 NaN NaN NaN
## 2004 NaN NaN NaN 0.001502 NaN
## 2008 0.000490 NaN NaN 0.001959 NaN
##
## [5 rows x 138 columns]
## NOC Edition AFG AHO ALG ANZ ARG ARM AUS AUT AZE \
## 0 1896 NaN NaN NaN NaN NaN NaN NaN NaN NaN
## 1 1900 NaN NaN NaN NaN NaN NaN -13.134766 -32.304688 NaN
## 2 1904 NaN NaN NaN NaN NaN NaN 0.000000 -30.169386 NaN
## 3 1908 NaN NaN NaN NaN NaN NaN 0.000000 -23.013510 NaN
## 4 1912 NaN NaN NaN -26.092774 NaN NaN 0.000000 6.254438 NaN
##
## NOC ... URS URU USA UZB VEN VIE YUG ZAM ZIM ZZX
## 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## 1 ... NaN NaN -9.448242 NaN NaN NaN NaN NaN NaN 33.561198
## 2 ... NaN NaN 199.651245 NaN NaN NaN NaN NaN NaN -22.642384
## 3 ... NaN NaN -19.549222 NaN NaN NaN NaN NaN NaN 0.000000
## 4 ... NaN NaN -12.105733 NaN NaN NaN NaN NaN NaN 0.000000
##
## [5 rows x 139 columns]
## NOC Edition AFG AHO ALG ANZ ARG ARM AUS \
## 21 1992 NaN 0.0 -7.214076 0.0 -6.767308 NaN 2.754114
## 22 1996 NaN 0.0 8.959211 0.0 1.306696 NaN 10.743275
## 23 2000 NaN 0.0 19.762488 0.0 0.515190 -26.935484 12.554986
## 24 2004 NaN 0.0 0.000000 0.0 9.625365 0.000000 8.161162
## 25 2008 NaN 0.0 -8.197807 0.0 8.588555 91.266408 6.086870
##
## NOC AUT AZE ... URS URU USA UZB VEN \
## 21 -3.034840 NaN ... 0.0 0.000000 -1.329330 NaN 0.000000
## 22 -3.876773 NaN ... 0.0 0.000000 -1.010378 NaN 0.000000
## 23 -3.464221 88.387097 ... 0.0 -12.025323 -1.341842 42.258065 0.000000
## 24 -2.186922 48.982144 ... 0.0 0.000000 -1.031922 21.170339 -1.615969
## 25 -3.389836 31.764436 ... 0.0 0.000000 -0.450031 14.610625 -6.987342
##
## NOC VIE YUG ZAM ZIM ZZX
## 21 NaN 0.000000 0.000000 0.000000 0.0
## 22 NaN -2.667732 -10.758472 0.000000 0.0
## 23 NaN -2.696445 0.000000 0.000000 0.0
## 24 0.000000 0.000000 0.000000 -43.491929 0.0
## 25 -0.661117 0.000000 0.000000 -23.316533 0.0
##
## [5 rows x 139 columns]
## NOC
## Edition
## 1972 NaN
## 1980 NaN
## 1988 NaN
## Edition NOC
## 0 1896 GRE
## 1 1900 FRA
## 2 1904 USA
## 3 1908 GBR
## 4 1912 SWE
## 5 1920 BEL
## 6 1924 FRA
## 7 1928 NED
## 8 1932 USA
## 9 1936 GER
## 10 1948 GBR
## 11 1952 FIN
## 12 1956 AUS
## 13 1960 ITA
## 14 1964 JPN
## 15 1968 MEX
## 16 1972 FRG
## 17 1976 CAN
## 18 1980 URS
## 19 1984 USA
## 20 1988 KOR
## 21 1992 ESP
## 22 1996 USA
## 23 2000 AUS
## 24 2004 GRE
## 25 2008 CHN
## (3588, 3) (26, 139)
## Edition NOC Change
## 567 1992 CHN 4.240630
## 568 1996 CHN 7.860247
## 569 2000 CHN -3.851278
## 570 2004 CHN 0.128863
## 571 2008 CHN 13.251332
## Edition NOC Change
## 0 1956 AUS 54.615063
## 1 2000 AUS 12.554986
## 2 1920 BEL 54.757887
## 3 1976 CAN -2.143977
## 4 2008 CHN 13.251332
## NOC Change
## Edition
## 1896 GRE NaN
## 1900 FRA 198.002486
## 1904 USA 199.651245
## 1908 GBR 134.489218
## 1912 SWE 71.896226
Summer Olympics - % Change in Medals (Host Country):
Chapter 1 - Basics of Relational Databases